How to run multiple tasks in single data flow task in SSIS project - sql

i have an issue that , i ran multiple data flow tasks in single control flow, if 5 out of 5 source alive it works fine but any one source is dead, it is not executing remaining 4 source flows,how to run which ever is alive should run smoothly when ever we are executing the job

I am assuming that the 5 data flows are all connected together on the control flow. The desire is to have all 5 data flows execute, regardless of success or failure of the previous data flow.
To accomplish this, you will need to change the Precedence Constraint from the current value of Success (green) to Completion (blue). To access the Precedence Constraint Editor, double click the connector line and you should see the following.

Another option is to place all 5 data flow tasks in a sequence container and have them run in parallel (by not connecting them to each other). It would look something like this:

Related

Azure Data Factory: Execute Pipeline activity cannot reference calling pipeline, cyclical behaviour required

I have a number of pipelines that need to cycle depending on availability of data. If the data is not there wait and try again. The pipe behaviours are largely controlled by a database which captures logs which are used to make decisions about processing.
I read the Microsoft documentation about the Execute Pipeline activity which states that
The Execute Pipeline activity allows a Data Factory or Synapse
pipeline to invoke another pipeline.
It does not explicitly state that it is impossible though. I tried to reference Pipe_A from Pipe_A but the pipe is not visible in the drop down. I need a work-around for this restriction.
Constraints:
The pipe must not call all pipes again, just the pipe in question. The preceding pipe is running all pipes in parallel.
I don't know how many iterations are needed and cannot specify this quantity.
As far as possible best effort has been implemented and this pattern should continue.
Ideas:
Create a intermediary pipe that can be referenced. This is no good I would need to do this for every pipe that requires this behaviour because dynamic content is not allowed for pipe selection. This approach would also pollute the Data Factory workspace.
Direct control flow backwards after waiting inside the same pipeline if condition is met. This won't work either, the If activity does not allow expression of flow within the same context as the If activity itself.
I thought about externalising this behaviour to a Python application which could be attached to an Azure Function if needed. The application would handle the scheduling and waiting. The application could call any pipe it needed and could itself be invoked by the pipe in question. This seems drastic!
Finally, I discovered an activity Until which has do while behaviour. I could wrap these pipes in Until, the pipe executes and finishes and sets database state to 'finished' or cannot finish and sets the state to incomplete and waits. The expression then either kicks off another execution or it does not. Additional conditional logic can be included as required in the procedure that will be used to set a value to variable used by the expression in the Until. I would need a variable per pipe.
I think idea 4 makes sense, I thought I would post this anyway in case people can spot limitations in this approach and/or recommend an approach.
Yes, absolutely agree with All About BI, its seems in your scenario the best suited ADF Activity is Until :
The Until activity in ADF functions as a wrapper and parent component
for iterations, with inner child activities comprising the block of
items to iterate over. The result (s) from those inner child
activities must then be used in the parent Until expression to
determine if another iteration is necessary. Alternatively, if the
pipeline can be maintained
The assessment condition for the Until activity might comprise outputs from other activities, pipeline parameters, or variables.
When used in conjunction with the Wait activity, the Until activity allows you to create loop conditions to periodically check the status of specific operations. Here are some examples:
Check to see if the database table has been updated with new rows.
Check to see if the SQL job is complete.
Check to see whether any new files have been added to a specific
folder.

Mule is taking a long time for the simple select for the first execution

I am just using a HTTP listener and Select in mule flow. It is a get method, passing ID as an input, and the same ID is passed to select (input). It is taking 3 to 4 minutes of delay when we execute via mule for the first time, but in DB, it took only millisecond.
This delay only happens after adding the parameter in the select.
Someone help me, why there is a delay for the first time and how to resolve it?
Possible cause could be how you create Metadata. For example you use huge CSV file as example for your data structure. Mule reads whole file to have headers. It takes time.
Solution - if you create Metadata by example - use small examples with couple rows of data.
Usually the main points that cause performance issues in first executions are:
JVM and Mule Runtime warmup
Time to establish connections
The first one can not be avoided. For the second one usually a connection pool is used to mitigate it somewhat. Having said that 4 minutes is a very excessive time for either of those. You need to do some performance analysis, adding logs before and after operation in the flow, enabling debug logs for the database connector and even using a Java profiler connected to the Mule JVM to understand what could be happening.
You also have to consider if there is a high number of records that need to be processed, even if the database can answer quickly, it might take some time to format.

How to run Pentaho transformations in parallel and limit executors count

The task is to run defined number of transformations (.ktr) in parallel.
Each transformation opens it's own database connection to read data.
But we have a limitation on given user, who has only 5 allowed parallel connection to DB and let's consider that this could not be changed.
So when I start job depicted below, only 5 transformations finish their work successfully, and other 5 fails with db connection error.
I know that there is an option to redraw job scheme to have only 5 parallel sequences, but I don't like this approach, as it requires reimplementation when count of threads changes.
Is it possible to configure some kind of pool of executors, so Pentaho job will understand that even if there were 10 transformations provided, only random 5 could be processed in parallel?
I am assuming that you know the number of parallel database connections available. If you know this, use switch/case component and then number of transformations in parallel. Second option is to use job-executor.In Job Executor, if you can set variable which in turn call the job accordingly. For example, you are calling a job using job-executor with value
c:/data-integrator/parallel_job_${5}.kjb where 5 is number of connections available
or
c:/data-integrator/parallel_job_${7}.kjb where 7 is number of connections available
Is this making sense to you.
The concept is the following:
Catch database connection error during transformation run attempt
Wait a couple of seconds
Retry run of a transformation
Look at attached transformation picture. It works for me.
Disadvantages:
A lot of connection errors in the logs, which could confuse.
Given solution could turn in infinite loop (but could be amended to avoid it)

Pentaho Logging specify Job or Trans for each line

I am running Pentaho Kettle 6.1 through a java application. All of the Pentaho logs are directed through the java app and logged out into the same log file at the java level.
When a job starts or finishes the logs indicate which job is starting or finishing, but when the job is in the middle of running the log output only indicates the specific step it is on without any indication of which job or trans is executing.
This causes confusion and is difficult to follow when there is more than one job running simultaneously. Does anyone know of a way to prepend the name of the job or trans to each log entry?
Not that I know, and I doubt there is for the simple reason that the same transformation/job may be split to run on more than one machine, by more that one user, and/or launched in parallel in different job hierarchies of callers.
The general answer is to log in a database (right-click any where, Parameters, Logging, define the logging table and what you want to log). All the logging will be copied to a table database together with a channel_id. This is a unique number that will be attributed to each "run" and link together all the logging information that comes from all the dependent job/transformations. You can then view this info with a SELECT...WHERE channel_id=...
However, you case seams to be simpler. Use the database logging with a log_intervale of, say, 2 seconds and SELECT TRANSNAME/JOBNAME, LOG_FIELD FROM LOG_TABLE continuously on your terminal.
You can also follow a specific job/transformation by logging in a specific table, but this means you know in advance which is the job/transformation to debug.

SSIS - Connection Management Within a Loop

I have the following SSIS package:
alt text http://www.freeimagehosting.net/uploads/5161bb571d.jpg
The problem is that within the Foreach loop a connection is opened and closed for each iteration.
On running SQL Profiler I see a series of:
Audit Login
RPC:Completed
Audit Logout
The duration for the login and the RPC that actually does the work is minimal. However, the duration for the logout is significant, running into several seconds each. This causes the JOB to run very slowly - taking many hours. I get the same problem when running either on a test server or stand-alone laptop.
Could anyone please suggest how I may change the package to improve performance?
Also, I have noticed that when running the package from Visual Studio, it looks as though it continues to run with the component blocks going amber then green but actually all the processing has been completed and SQL profiler has dropped silent?
Thanks,
Rob.
Have you tried running your data flow task in parallel vs serial? You can most likely break up your for loops to enable you to run each 'set' in parallel, so while it might still be expensive to login/out, you will be doing it N times simultaneously.
SQL Server is most performant when running a batch of operations in a single query. Is it possible to redesign your package so that it batches updates in a single call, rather than having a procedural workflow with for-loops, as you have it here?
If the design of your application and the RPC permits (or can be refactored to permit it), this might be the best solution for performance.
For example, instead of something like:
for each Facility
for each Stock
update Qty
See if you can create a structure (using SQL, or a bulk update RPC with a single connection) like:
update Qty
from Qty join Stock join Facility
...
If you control the implementation of the RPC, the RPC could maintain the same API (if needed) by delegating to another which does the batch operation, but specifies a single-record restriction (where record=someRecord).
Have you tried doing the following?
In your connection managers for the connection that is used within the loop, right click and choose properties. In the properties for the connection, find "RetainSameConnection" and change it to True from the default of False. This will let your package maintain the connection throughout your package run. Your profiler would then probably look like:
Audit Login
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
...
Audit Logout
With the final Audit Logout happening at the end of package execution.