How to add(concatenate) variables inside batch processing in mule 4? - mule

I am processing records from one DB to another DB. The batch job is being called multiple times in a single request(triggering the process API URL only one time).
How can I add the total records processed(given by the payload at the on-complete phase) for one complete request?
For eg, I ran the process, and three times the batch job executed. So I want to have the sum of all the records in all the 3 batch jobs.

That's not possible because of how the Batch scope works:
In the On Complete phase, none of these variables (not even the
original ones) are visible. Only the final result is available in this
phase. Moreover, since the Batch Job Instance executes asynchronously
from the rest of the flow, no variable set in either a Batch Step or
the On Complete phase will be visible outside the Batch Scope.
source: https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#variable-propagation
What you could do is to store the results in a persistent repository, for example in you database.

Related

How to set KOFAX KTM Server global variable value which will be initialized in Batch open, updated in SeparateCurrentPage & used in BatchClose?

I am trying to count a specific barcode value from Project.Document_SeparateCurrentPage and use it in BatchClose to compare if the count is greater than 1 and if it is >1 then send the batch to a specific queue with specific priority. I used a global variable in KTM Project Script to hold the count value which was initialized to 0 in Batch open. It worked fine until unit testing. But our automation team found that out of 20 similar batches, few batches were sent to the queue where the batch should go only if the count satisfies the greater than one condition, though they used only one barcode.
I googled and found that KTM Server script events do not allow to use shared information in different processes(https://docshield.kofax.com/KTM/en_US/6.4.0-uuxag78yhr/help/SCRIPT/ScriptDocumentation/c_ServerScriptEvents.html). Then I tried to use a batch field to hold the barcode count but unable to update its value from Project.Document_SeparateCurrentPage function using pXRootFolder.Fields.ItemByName("BatchFieldName").Text = "GreaterThanOne". The logs show that the batch reads the first page three times and then errors out.
Any links would help. Thanks in advance.
As you mentioned, the different phases of batch/document processing can execute in different processes, so global variables initialized in one event won’t necessarily be available in others. Ideally you should only use global variables if their content can be set from Application_InitializeScript or Application_InitializeBatch, because these events occur in each separate process. As you’ve found out, you shouldn’t use a global variable for your use case, because Document_SeparateCurrentPage and Batch_Close for one batch may occur in different processes, just as the same process will likely execute those events for multiple batches.
Also, you cannot set batch fields from document level events for a related reason: any number of separate processes could be processing documents of a batch in parallel, so batch level data is read-only to document events. It is a bit unintuitive, but separation is a document level event even though it seems like it is acting on the whole batch. (The three times you saw is just an error retry mechanism.)
If it meets your needs, the simplest answer might be to use a barcode locator as part of normal extraction (not just separation), and assign to a field if needed. While you cannot set batch fields from document events, you can read document data from batch events. So instead of trying to track something like a count over the course of document events, just make sure whatever data you need is saved at a document level. Then in a Batch_Close you can iterate the documents and count/calculate whatever you need. (In your case maybe the number of locator alternatives for the barcode locator, across each document.)

SQLAgent job with different schedules

I am looking to see if its possible to have one job that runs different schedules, with the catch being one of the schedules needs to pass in a parameter.
I have an executable that will run some functionality when there is no parameter, but if there is a parameter present it will run some additional logic.
Setting up my job I created a schedule (every 15 minutes), Operating system (CmdExec)
runApplication.exe
For the other schedule I would like it to be once per day however the executable would need to be: runApplication.exe "1"
I dont think I can create a different step with a separate schedule, or can I?
Anyone have any ideas on how to achieve this without having two separate jobs?
There's no need for 2 jobs. What you can do is update your script so the result of your job (your parameter) is stored in a table. Then update your secondary logic to reference that table. If there's a value of parameter, then run your secondary logic. All in one script. If there's no value in that parameter, then have your secondary logic to return a 0 or not run at all.
Just make sure you either truncate the entire reference parameter table every run or you store a date in there so you know which one to reference.
Good luck.

Why BigtableIO writes records one by one after GroupBy/Combine DoFn?

Is someone aware of how the bundles are working within BigtableIO? Everything looks fine until one is using GroupBy or Combine DoFn. At this point, the pipeline would change the pane of our PCollection element from PaneInfo.NO_FIRING to PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} and then BigtableIO will output the following log INFO o.a.b.sdk.io.gcp.bigtable.BigtableIO - Wrote 1 records. Is the logging causing a performance issue when one have millions records to output or is it the fact that BigtableIO is opening and closing a writer for each record?
BigtableIO sends multiple records in a batch RPC. However, that assumes there there are multiple records sent in the "bundle". Bundle sizes are dependent on a combination of the step before hand, and the Dataflow framework. The problems you're seeing don't seem to be related to BigtableIO directly.
FWIW, here's the code for logging the number of records that occurs in the finishBundle() method.

How to create a Priority queue schedule in Autosys?

Technologies available: Autosys, Informatica, Unix scripting, Database (available via informatica)
How our batch currently works is with filewatchers looking for a file called "control.txt" which gets deleted when a feed starts processing. It gets recreated once completed which allows all "control" autosys jobs waiting, to have one pick up the control file and begin processing data feeds one by one.
However, the system has grown large, and some feeds have become more important than others, and we're looking at ways to improve our scheduler to prioritize feeds over others.
With the current design, of one a file deciding when the next feed runs, it can't be done, and I haven't been able to come up with a simple solution to make it happen.
Example:
1. Feed A is processing
2. Feed B, Feed C, Feed X, Feed F come in while Feed A is processing
3. Need to ensure that Feed B is processed next, even though C, X, F are ready.
4. C, X, F have a lower priority than A and B, but have the same priority and can process in any order
A very interesting question. One thing that I can think of is to have an extra Autosys job with a shell script that copies the file in certain order. Like:
Create input folder e.g. StageFolder
Let's call your current Autosys input folder "the InputFolder"
Have Autosys monitor it and for any file run a OrderedFileCopyScript.sh, every minute
OrderedFileCopyScript.sh should copy one file from StageFolder to InputFolder in desired order only if InputFolder is empty
I hope I made myself clear.
I oppose use of Autosys for this requirement ! Wrong tool !
I don't know all the details but considering an application with the usual reference tables.
In this case you should make use of feed reference table to include relative priorities.
I would suggest to create(or reuse) a table to loaded by the successor job of the file watcher.
1) Table to contain the unprocessed file with the corresponding priority and then use this table to process the files based on the priority.
2) Remove/archive the entries once done.
3) Have another job of this and run like a daemon with start_times/run_window.
This gives the flexibility to deal with change in priorities and keeps overall design simple.
This gives

Combine the Output of 3 Transformation in Pentaho

I'm executing 3 transformations in parallel. the o/p of three transformation contains same column names.
I've added output of all transformation to common dummy step in job and also added WaitForSql step to wait until all 3 transformations have completed execution, and also added unique step in next transformation to remove duplicate records.
All works proper till WaitForSQL, but when next transformation gets rows from result and performs Unique step I get duplicate records also when I perform Unique step.
Has anyone solution for this issue, plz reply.....
You have to sort your resulting stream after the dummy step before removing the duplicate rows. The sort will also make sure that all 3 streams are completed before sorting.
I didn't know you could use the dummy step to combine stream results. I always used the append streams-step for that.
Several points:
You cannot simply combine the outputs of multiple transformations at the job level. You wil need another transformation to read the data using the Get rows from result; jobs don't know about data streams, they only know about tasks (job entries) and exit status.
Be careful with "Launch next entries in parallel" at the job level. Lets say you have 2 transformations, trans1 and trans2, launched in parallel, followed by a dummy step. The dummy will be called TWICE, once after trans1 finishes and another when trans2 finishes. A job hop is not a data stream, it's a workflow. If you want to run transformations in parallel and later go back to a single workflow you need a subjob that calls the transformations and doens't have a Success job entry. That way, the subjob only finishes after the 2nd transformation finishes and only then it goes to the dummy step in the parent.
Why do you need those transformations running inside a job? If they have the same column structure, why don't you call them as sub transformations inside a transformation, and not a job? Steps in a transformation are always launched in parallel, so if you're parallelizing things for performance, a transformation is the way to do it, not a job. A job is meant to run multiple tasks in sequential order, one after the other, with workflow control depending on the result of the previous step.
If you want the Output from three of them into a single file, then you could run the same 3 instances in a single transformation with the append to target option ticked in your output step