Pentaho to kill all running transformations if any one them fails - pentaho

I have a wrapper job which runs 4 transformations in parallel. I want to kill all the four transformations if any one of the running transformation fails.
If it would have been a wrapper transformation there is a possibility of error handling through setting a condition ExecutionNrErrors > 0.
If I add an abort job step on all these transformations it will make the other transformations killed but with a green tick instead of a red tick.
How do we achieve this in Pentaho Jobs?

I guess you are looking for a solution like that:
It won't work, but even if no transformation fails. The rule in Pentaho Data Integrator is to start a transformation as soon as possible. So the Success or Failure step will start as soon as one of the transformation finishes.
You are warned of this facts, when you specify the transformations to run parallel.
If you want the transformations parallel, you have to define a lock mechanism yourself. You can also replace your main job by a transformation, in which every thing is parallel, and you have Blocking step to wait for all teh transformation to finish.

Related

ADF Dataflows; Do I have any control or influence over cluster startup time. (NOT "TTL")

Yes, I know about TTL; Yes, I'm configuring that; No, that's not what I'm asking about here.
Spinning up an initial cluster for a Dataflow takes around 5 minutes.
Starting acquiring compute from an existing "warm" cluster (i.e. one which has been left 'Alive' using TTL), for a new dataflow still appears to take 1-2 minutes.
Those are pretty large numbers, especially if you have a multi-step ETL process, and have broken up your pipeline to separate concerns (or if you're executing the dataflows in a loop, to process data per-source-day)
Controlling the TTL gives me some control over which of those two possibilities I'm triggering, but even 2 minutes can be a quite substantial overhead. (I have a pipeline where fully half the execution time is waiting for those 1-2 minute 'Acquire Compute' startups)
Do I have any control at all, over how long startup takes in each case? Is there anything that I can do to speed up the startup, or anything that I should avoid to prevent making things even worse!
There's a new feature in town, to fix exactly this problem.
Release blog:
https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365
ADF has added a new option in the Azure Integration Runtime for data flow TTL: Quick re-use. ... By selecting the re-use option with a TTL setting, you can direct ADF to maintain the Spark cluster for that period of time after your last data flow executes in a pipeline. This will provide much faster sequential executions using that same Azure IR in your data flow activities.

How to run Pentaho transformations in parallel and limit executors count

The task is to run defined number of transformations (.ktr) in parallel.
Each transformation opens it's own database connection to read data.
But we have a limitation on given user, who has only 5 allowed parallel connection to DB and let's consider that this could not be changed.
So when I start job depicted below, only 5 transformations finish their work successfully, and other 5 fails with db connection error.
I know that there is an option to redraw job scheme to have only 5 parallel sequences, but I don't like this approach, as it requires reimplementation when count of threads changes.
Is it possible to configure some kind of pool of executors, so Pentaho job will understand that even if there were 10 transformations provided, only random 5 could be processed in parallel?
I am assuming that you know the number of parallel database connections available. If you know this, use switch/case component and then number of transformations in parallel. Second option is to use job-executor.In Job Executor, if you can set variable which in turn call the job accordingly. For example, you are calling a job using job-executor with value
c:/data-integrator/parallel_job_${5}.kjb where 5 is number of connections available
or
c:/data-integrator/parallel_job_${7}.kjb where 7 is number of connections available
Is this making sense to you.
The concept is the following:
Catch database connection error during transformation run attempt
Wait a couple of seconds
Retry run of a transformation
Look at attached transformation picture. It works for me.
Disadvantages:
A lot of connection errors in the logs, which could confuse.
Given solution could turn in infinite loop (but could be amended to avoid it)

Is it possible to schedule a Retry for a SQL job instead of its Step components whenever one of the Steps fail?

I am using SQL Server 2014 and I have a SQL job (an SSIS package which contains 11 steps) which has been scheduled to run on a daily basis at a specific time.
I know one can schedule each step to attempt a retry whenever that step fails. However, is there a way to configure a retry for the whole SQL job whenever the job fails at any step during the process? That is, if say, the job fails at Step 8, the whole job is run again from Step 1.
The tidiest solution I can think of would be creating an error handling Step in your Job which will be executed when any other step fails (change the On Failure action of all other steps to jump to this one) and managing the job's schedule to trigger again on the following minute, after the job ends. This way you will see the execution history of the job at the agent.
You will have to keep in mind recurrent failures, I doubt you want the job to be repeating itself indefinitely.
To configure the job to trigger, you can add a Schedule that fires every minute and enable/disable it when necessary. The job won't fire if it's already running.

Stop executing a pipeline transform while other pipeline transforms keep running

I have a number of files in google storage which I have to write to multiple tables in BigQuery after applying a simple ParDo transform which I am trying to execute using a single pipeline. So basically I have a number of parallel unconnected sources and sinks running with a single pipeline in one dataflow job.
In the Pardo transform, I have a condition which if it evaluates to true, then writing to the particular BigQuery table(transform) has to stop while writing to other BigQuery tables(other transforms) remain as usual.
In this image, there are 2 parallel sources and 2 parallel sinks, Because of some bad data in source for date 2014-08-01, the first transform failed. Once the 2014-08-01 transform failed, the 2014-08-02 tranform got cancelled. The 2014-08-02 transform had no bad data.
Is there a way to prevent the cancellation of the other transform?
Currently in the Dataflow service, an entire pipeline will either succeed or fail, and any failure will cancel the rest of the pipeline. There's no way to change this behavior; you need to run separate pipelines if you want to have them succeed or fail separately.
Note that operationally, you can run both pipelines from the same Java main program; just create two different Pipeline objects and invoke run() on them separately.

SSIS 2005 Control Flow Priority

The short version is I am looking for a way to prioritize certain tasks in SSIS 2005 control flows. That is I want to be able to set it up so that Task B does not start until Task A has started but Task B does not need to wait for Task A to complete. The goal is to reduce the amount of time where I have idle threads hanging around waiting for Task A to complete so that they can move onto Tasks C, D & E.
The issue I am dealing with is converting a data warehouse load from a linear job that calls a bunch of SPs to an SSIS package calling the same SPs but running multiple threads in parallel. So basically I have a bunch of Execute SQL Task and Sequence Container objects with Precedent Constraints mapping out the dependencies. So far no problems, things are working great and it cut our load time a bunch.
However I noticed that tasks with no downstream dependencies are commonly being sequenced before those that do have dependencies. This is causing a lot of idle time in certain spots that I would like to minimize.
For example: I have about 60 procs involved with this load, ~10 of them have no dependencies at all and can run at any time. Then I have another one with no upstream dependencies but almost every other task in the job is dependent on it. I would like to make sure that the task with the dependencies is running before I pick up any of the tasks with no dependencies. This is just one example, there are similar situations in other spots as well.
Any ideas?
I am late in updating over here but I also raised this issue over on the MSDN forums and we were able to devise a partial work around. See here for the full thread, or here for the feature request asking microsoft to give us a way to do this cleanly...
The short version is that you use a series of Boolean variables to control loops that act like roadblocks and prevent the flow from reaching the lower priority tasks until the higher priority items have started.
The steps involved are:
Declare a bool variable for each of the high priority tasks and default the values to false.
Create a pre-execute event for each of the high priority tasks.
In the pre-execute event create a script task which sets the appropriate bool to true.
At each choke point insert a for each loop that will loop while the appropriate bool(s) are false. (I have a script with a 1 second sleep inside each loop but it also works with empty loops.)
If done properly this gives you a tool where at each choke point the package has some number of high priority tasks ready to run and a blocking loop that keeps it from proceeding down the lower priority branches until said high priority items are running. Once all of the high priority tasks have been started the loop clears and allows any remaining threads to move on to lower priority tasks. Worst case is one thread sits in the loop while waiting for other threads to come along and pick up the high priority tasks.
The major drawback to this approach is the risk of deadlocking the package if you have too many blocking loops get queued up at the same time, or misread your dependencies and have loops waiting for tasks that never start. Careful analysis is needed to decide which items deserved higher priority and where exactly to insert the blocks.
I don't know any elegant ways to do this but my first shot would be something like this..
Sequence Container with the proc that has to run first. In that same sequence container put a script task that just waits 5-10 seconds or so before each of the 10 independent steps can run. Then chain the rest of the procs below that sequence container.