How to run Pentaho transformations in parallel and limit executors count - pentaho

The task is to run defined number of transformations (.ktr) in parallel.
Each transformation opens it's own database connection to read data.
But we have a limitation on given user, who has only 5 allowed parallel connection to DB and let's consider that this could not be changed.
So when I start job depicted below, only 5 transformations finish their work successfully, and other 5 fails with db connection error.
I know that there is an option to redraw job scheme to have only 5 parallel sequences, but I don't like this approach, as it requires reimplementation when count of threads changes.
Is it possible to configure some kind of pool of executors, so Pentaho job will understand that even if there were 10 transformations provided, only random 5 could be processed in parallel?

I am assuming that you know the number of parallel database connections available. If you know this, use switch/case component and then number of transformations in parallel. Second option is to use job-executor.In Job Executor, if you can set variable which in turn call the job accordingly. For example, you are calling a job using job-executor with value
c:/data-integrator/parallel_job_${5}.kjb where 5 is number of connections available
or
c:/data-integrator/parallel_job_${7}.kjb where 7 is number of connections available
Is this making sense to you.

The concept is the following:
Catch database connection error during transformation run attempt
Wait a couple of seconds
Retry run of a transformation
Look at attached transformation picture. It works for me.
Disadvantages:
A lot of connection errors in the logs, which could confuse.
Given solution could turn in infinite loop (but could be amended to avoid it)

Related

Mule is taking a long time for the simple select for the first execution

I am just using a HTTP listener and Select in mule flow. It is a get method, passing ID as an input, and the same ID is passed to select (input). It is taking 3 to 4 minutes of delay when we execute via mule for the first time, but in DB, it took only millisecond.
This delay only happens after adding the parameter in the select.
Someone help me, why there is a delay for the first time and how to resolve it?
Possible cause could be how you create Metadata. For example you use huge CSV file as example for your data structure. Mule reads whole file to have headers. It takes time.
Solution - if you create Metadata by example - use small examples with couple rows of data.
Usually the main points that cause performance issues in first executions are:
JVM and Mule Runtime warmup
Time to establish connections
The first one can not be avoided. For the second one usually a connection pool is used to mitigate it somewhat. Having said that 4 minutes is a very excessive time for either of those. You need to do some performance analysis, adding logs before and after operation in the flow, enabling debug logs for the database connector and even using a Java profiler connected to the Mule JVM to understand what could be happening.
You also have to consider if there is a high number of records that need to be processed, even if the database can answer quickly, it might take some time to format.

Run SQL statements asynchronously

I have six SQL UPDATE statements (on the same database) as an SQL Agent job that I run every night to ensure that two systems are in sync with each other. Each update takes about 10 minutes to run.
As a test today I opened SQL Studio Manager and opened five windows and ran the five updates concurrently (I can guarantee that a row can only ever be updated by one SQL statement). The five queries ran in 15 minutes.
Therefore instead of using a single SQL Agent Job I am thinking about calling the SQL statements from a VB.NET program, so that I can either:
1) Use asynchronous calls to ensure the queries are running concurrently.
2) Use multiple threads to ensure the queries are running concurrently
I read an article recently that says that asynchronous calls should not be used to speed up processing performance. Therefore I believe that multiple threads is the answer. Is that correct?
I read an article recently that says that asynchronous calls should not be used to speed up processing performance. Therefore I believe that multiple threads is the answer. Is that correct?
I think either what you read is wrong, or you've misinterpreted it. Running things concurrently will not speed that thing up, but will allow more things to happen in parallel by freeing up threads (on Windows threads are expensive to create: creating and destroying them over short periods should be avoided).
Concurrency (eg. using .NET 4.5.1's async support) allows the activity, including starting other asynchronous actions, to continue while the thread is used for something else.
The details of how to do this depend on how you are accessing the database: Entity Framework (EF), ADO.NET, or something else?
With EF you can use the extension methods in QueryableExtensions like ToListAsync on queries.
With ADO.NET you can use SqlCommand methods like ExecuteNonQueryAsync and ExecuteReaderAsync.
since you are dealing with a sql statement the choice you make in vb.net will not affect the performances or the time required to complete the 5 tasks on sql.
if you make 5 async calls then you will sit waiting for 5 answers; if you spawn 5 threads these threads will sit waiting for their syncronous calls to finish. the net result will be the same.
me too i'm pushing for the 5 agent jobs: is a solution that leverages existing sql tools, does not requires additional coding (more coding = more maintenance) and is available out of the box on almost any sql instance.

How to run multiple tasks in single data flow task in SSIS project

i have an issue that , i ran multiple data flow tasks in single control flow, if 5 out of 5 source alive it works fine but any one source is dead, it is not executing remaining 4 source flows,how to run which ever is alive should run smoothly when ever we are executing the job
I am assuming that the 5 data flows are all connected together on the control flow. The desire is to have all 5 data flows execute, regardless of success or failure of the previous data flow.
To accomplish this, you will need to change the Precedence Constraint from the current value of Success (green) to Completion (blue). To access the Precedence Constraint Editor, double click the connector line and you should see the following.
Another option is to place all 5 data flow tasks in a sequence container and have them run in parallel (by not connecting them to each other). It would look something like this:

SSIS 2005 Control Flow Priority

The short version is I am looking for a way to prioritize certain tasks in SSIS 2005 control flows. That is I want to be able to set it up so that Task B does not start until Task A has started but Task B does not need to wait for Task A to complete. The goal is to reduce the amount of time where I have idle threads hanging around waiting for Task A to complete so that they can move onto Tasks C, D & E.
The issue I am dealing with is converting a data warehouse load from a linear job that calls a bunch of SPs to an SSIS package calling the same SPs but running multiple threads in parallel. So basically I have a bunch of Execute SQL Task and Sequence Container objects with Precedent Constraints mapping out the dependencies. So far no problems, things are working great and it cut our load time a bunch.
However I noticed that tasks with no downstream dependencies are commonly being sequenced before those that do have dependencies. This is causing a lot of idle time in certain spots that I would like to minimize.
For example: I have about 60 procs involved with this load, ~10 of them have no dependencies at all and can run at any time. Then I have another one with no upstream dependencies but almost every other task in the job is dependent on it. I would like to make sure that the task with the dependencies is running before I pick up any of the tasks with no dependencies. This is just one example, there are similar situations in other spots as well.
Any ideas?
I am late in updating over here but I also raised this issue over on the MSDN forums and we were able to devise a partial work around. See here for the full thread, or here for the feature request asking microsoft to give us a way to do this cleanly...
The short version is that you use a series of Boolean variables to control loops that act like roadblocks and prevent the flow from reaching the lower priority tasks until the higher priority items have started.
The steps involved are:
Declare a bool variable for each of the high priority tasks and default the values to false.
Create a pre-execute event for each of the high priority tasks.
In the pre-execute event create a script task which sets the appropriate bool to true.
At each choke point insert a for each loop that will loop while the appropriate bool(s) are false. (I have a script with a 1 second sleep inside each loop but it also works with empty loops.)
If done properly this gives you a tool where at each choke point the package has some number of high priority tasks ready to run and a blocking loop that keeps it from proceeding down the lower priority branches until said high priority items are running. Once all of the high priority tasks have been started the loop clears and allows any remaining threads to move on to lower priority tasks. Worst case is one thread sits in the loop while waiting for other threads to come along and pick up the high priority tasks.
The major drawback to this approach is the risk of deadlocking the package if you have too many blocking loops get queued up at the same time, or misread your dependencies and have loops waiting for tasks that never start. Careful analysis is needed to decide which items deserved higher priority and where exactly to insert the blocks.
I don't know any elegant ways to do this but my first shot would be something like this..
Sequence Container with the proc that has to run first. In that same sequence container put a script task that just waits 5-10 seconds or so before each of the 10 independent steps can run. Then chain the rest of the procs below that sequence container.

Database Job Scheduling

I have a procedure written in PLJava that sends out updates over JMS in my postgres database.
What I would like to do is have that function called on an interval (every 15 seconds) internally in the database (preferably not from an outside process). Is this possible? Any ideas?
If you need no external access, you are presumably able to modify the database design so that you don't need the update at all. Can you explain more about what the update is doing?
As depesz said, you could use either cron or pgAgent, but they are only able to go down to a one minute granularity, not 15 seconds. Considering sleeping inside the stored procedure until the next iteration is not a good idea, because you will have an open transaction for all that time which is a really bad idea.
Strict answer: it is not possible. Since you don't want outside process, and PostgreSQL doesn't support jobs - you are out of luck.
If you'll reconsider using outside processes, then you're most likely want something like cron, or better yet pgagent.
On absolutely other hand - what do you need to do that has to happen every 30 seconds? this seems like a problem with design.
First, you'll spend the least amount of effort if you just go with a cron job.
However, if you were starting from scracth: You are trying to periodically replicate rows from your database. I think you are looking at a replication queue.
The PGQ project (used for Londiste replication, both from Skype's SkyTools) has a queue that you can use independently. When configuring it, you set a maximum event count, and a loop delay, before batched events are generated. You can get batches spaced by no more than 15 seconds that way. You now have to produce the events that will be batched, using a trigger that calls pgq.insert_event; and consume the queues. The consumer can call your PL/Java stored proc; you'll have to rewrite the procedure to send everything in the batch instead of scanning the base table for new events.
As far as I know postgresql doesn't support scheduled tasks. You'll need to use a script with cron or at (depending on your operating system.)
Sounds like you're doing sort of replication? Every 15s sounds like a lot of updates. Could you setup a trigger (or a number of triggers) instead of polling?
If you are using JMS why not just have th task wait for input on the queue?
Per your depesz comment, you have a PL/Java stored procedure that "flushes out database tables (updates) as java objects". Since you want it to run in 15 second intervals, it must be processing a batch of updates each time. Rather than processing a batch of updates in a stored procedure every 15 seconds, why not process them one at a time when they happen via an after update trigger and eliminate the need for a timed interval. If you are aggregrating data from multiple tables to build your objects than add the triggers to you upper most tables only.
In my case the problem was that agent couldn't authorize to database so after I've made all connections trusted from localhost the service started successfully and job works fine
for more information about error you should see into windows event viewer or eq in unix based system. see my config file C:\Program Files\PostgreSQL\10\data\pg_hba.conf