Combine the Output of 3 Transformation in Pentaho - pentaho

I'm executing 3 transformations in parallel. the o/p of three transformation contains same column names.
I've added output of all transformation to common dummy step in job and also added WaitForSql step to wait until all 3 transformations have completed execution, and also added unique step in next transformation to remove duplicate records.
All works proper till WaitForSQL, but when next transformation gets rows from result and performs Unique step I get duplicate records also when I perform Unique step.
Has anyone solution for this issue, plz reply.....

You have to sort your resulting stream after the dummy step before removing the duplicate rows. The sort will also make sure that all 3 streams are completed before sorting.
I didn't know you could use the dummy step to combine stream results. I always used the append streams-step for that.

Several points:
You cannot simply combine the outputs of multiple transformations at the job level. You wil need another transformation to read the data using the Get rows from result; jobs don't know about data streams, they only know about tasks (job entries) and exit status.
Be careful with "Launch next entries in parallel" at the job level. Lets say you have 2 transformations, trans1 and trans2, launched in parallel, followed by a dummy step. The dummy will be called TWICE, once after trans1 finishes and another when trans2 finishes. A job hop is not a data stream, it's a workflow. If you want to run transformations in parallel and later go back to a single workflow you need a subjob that calls the transformations and doens't have a Success job entry. That way, the subjob only finishes after the 2nd transformation finishes and only then it goes to the dummy step in the parent.
Why do you need those transformations running inside a job? If they have the same column structure, why don't you call them as sub transformations inside a transformation, and not a job? Steps in a transformation are always launched in parallel, so if you're parallelizing things for performance, a transformation is the way to do it, not a job. A job is meant to run multiple tasks in sequential order, one after the other, with workflow control depending on the result of the previous step.

If you want the Output from three of them into a single file, then you could run the same 3 instances in a single transformation with the append to target option ticked in your output step

Related

How to add(concatenate) variables inside batch processing in mule 4?

I am processing records from one DB to another DB. The batch job is being called multiple times in a single request(triggering the process API URL only one time).
How can I add the total records processed(given by the payload at the on-complete phase) for one complete request?
For eg, I ran the process, and three times the batch job executed. So I want to have the sum of all the records in all the 3 batch jobs.
That's not possible because of how the Batch scope works:
In the On Complete phase, none of these variables (not even the
original ones) are visible. Only the final result is available in this
phase. Moreover, since the Batch Job Instance executes asynchronously
from the rest of the flow, no variable set in either a Batch Step or
the On Complete phase will be visible outside the Batch Scope.
source: https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#variable-propagation
What you could do is to store the results in a persistent repository, for example in you database.

SQLAgent job with different schedules

I am looking to see if its possible to have one job that runs different schedules, with the catch being one of the schedules needs to pass in a parameter.
I have an executable that will run some functionality when there is no parameter, but if there is a parameter present it will run some additional logic.
Setting up my job I created a schedule (every 15 minutes), Operating system (CmdExec)
runApplication.exe
For the other schedule I would like it to be once per day however the executable would need to be: runApplication.exe "1"
I dont think I can create a different step with a separate schedule, or can I?
Anyone have any ideas on how to achieve this without having two separate jobs?
There's no need for 2 jobs. What you can do is update your script so the result of your job (your parameter) is stored in a table. Then update your secondary logic to reference that table. If there's a value of parameter, then run your secondary logic. All in one script. If there's no value in that parameter, then have your secondary logic to return a 0 or not run at all.
Just make sure you either truncate the entire reference parameter table every run or you store a date in there so you know which one to reference.
Good luck.

Why BigtableIO writes records one by one after GroupBy/Combine DoFn?

Is someone aware of how the bundles are working within BigtableIO? Everything looks fine until one is using GroupBy or Combine DoFn. At this point, the pipeline would change the pane of our PCollection element from PaneInfo.NO_FIRING to PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0} and then BigtableIO will output the following log INFO o.a.b.sdk.io.gcp.bigtable.BigtableIO - Wrote 1 records. Is the logging causing a performance issue when one have millions records to output or is it the fact that BigtableIO is opening and closing a writer for each record?
BigtableIO sends multiple records in a batch RPC. However, that assumes there there are multiple records sent in the "bundle". Bundle sizes are dependent on a combination of the step before hand, and the Dataflow framework. The problems you're seeing don't seem to be related to BigtableIO directly.
FWIW, here's the code for logging the number of records that occurs in the finishBundle() method.

How to create a Priority queue schedule in Autosys?

Technologies available: Autosys, Informatica, Unix scripting, Database (available via informatica)
How our batch currently works is with filewatchers looking for a file called "control.txt" which gets deleted when a feed starts processing. It gets recreated once completed which allows all "control" autosys jobs waiting, to have one pick up the control file and begin processing data feeds one by one.
However, the system has grown large, and some feeds have become more important than others, and we're looking at ways to improve our scheduler to prioritize feeds over others.
With the current design, of one a file deciding when the next feed runs, it can't be done, and I haven't been able to come up with a simple solution to make it happen.
Example:
1. Feed A is processing
2. Feed B, Feed C, Feed X, Feed F come in while Feed A is processing
3. Need to ensure that Feed B is processed next, even though C, X, F are ready.
4. C, X, F have a lower priority than A and B, but have the same priority and can process in any order
A very interesting question. One thing that I can think of is to have an extra Autosys job with a shell script that copies the file in certain order. Like:
Create input folder e.g. StageFolder
Let's call your current Autosys input folder "the InputFolder"
Have Autosys monitor it and for any file run a OrderedFileCopyScript.sh, every minute
OrderedFileCopyScript.sh should copy one file from StageFolder to InputFolder in desired order only if InputFolder is empty
I hope I made myself clear.
I oppose use of Autosys for this requirement ! Wrong tool !
I don't know all the details but considering an application with the usual reference tables.
In this case you should make use of feed reference table to include relative priorities.
I would suggest to create(or reuse) a table to loaded by the successor job of the file watcher.
1) Table to contain the unprocessed file with the corresponding priority and then use this table to process the files based on the priority.
2) Remove/archive the entries once done.
3) Have another job of this and run like a daemon with start_times/run_window.
This gives the flexibility to deal with change in priorities and keeps overall design simple.
This gives

Process Each Row in Kettle ONE AT A TIME?

I was wondering if it is possible to work on a per row basis in the kettle?
I am trying to implement a reporting scheme which consists of a table, where the requests get queued for processing and then the Pentaho job that picks up the records on that table.
my job currently has 3 transformations in it,
1st is to get records from the queued requests table
2nd is to analyze the values on each record and come up with multiple results based on that record. for example, a user would request to have records of movies of the horror genre. then it should spit out the horror movies
3rd is to further retrieve the information about the movies such as the year, director and etc, which is to be outputted to an excel file.
this is the idea, but it's a bit challenging doing it in Pentaho as it does stuff all at the same. is there a way that I can make my job work on records one by one?
EDIT.
Just to add, I have been trying to extend the implementation of the Pentaho cookbook sample but if I compare to my design, its like step 2 and step 3 only.
I can't seem to make the table input step work one at a time.
i just made it act like the implementation in the cookbook, i did adjustments on it. instead of using two transformations to gather all the necessary fields, i just retrieved all the information that i need in 1 transformation.
then after that i copied those information to the next steps, then some queries to complete the information and it is now working.
passing parameters between transformations is a bit confusing, there are parameters to be set on the transformation itself and also on the job where the transformations lay so i kinda went guessing for some time just to make it work.