SSIS 2005 Control Flow Priority - sql-server-2005

The short version is I am looking for a way to prioritize certain tasks in SSIS 2005 control flows. That is I want to be able to set it up so that Task B does not start until Task A has started but Task B does not need to wait for Task A to complete. The goal is to reduce the amount of time where I have idle threads hanging around waiting for Task A to complete so that they can move onto Tasks C, D & E.
The issue I am dealing with is converting a data warehouse load from a linear job that calls a bunch of SPs to an SSIS package calling the same SPs but running multiple threads in parallel. So basically I have a bunch of Execute SQL Task and Sequence Container objects with Precedent Constraints mapping out the dependencies. So far no problems, things are working great and it cut our load time a bunch.
However I noticed that tasks with no downstream dependencies are commonly being sequenced before those that do have dependencies. This is causing a lot of idle time in certain spots that I would like to minimize.
For example: I have about 60 procs involved with this load, ~10 of them have no dependencies at all and can run at any time. Then I have another one with no upstream dependencies but almost every other task in the job is dependent on it. I would like to make sure that the task with the dependencies is running before I pick up any of the tasks with no dependencies. This is just one example, there are similar situations in other spots as well.
Any ideas?

I am late in updating over here but I also raised this issue over on the MSDN forums and we were able to devise a partial work around. See here for the full thread, or here for the feature request asking microsoft to give us a way to do this cleanly...
The short version is that you use a series of Boolean variables to control loops that act like roadblocks and prevent the flow from reaching the lower priority tasks until the higher priority items have started.
The steps involved are:
Declare a bool variable for each of the high priority tasks and default the values to false.
Create a pre-execute event for each of the high priority tasks.
In the pre-execute event create a script task which sets the appropriate bool to true.
At each choke point insert a for each loop that will loop while the appropriate bool(s) are false. (I have a script with a 1 second sleep inside each loop but it also works with empty loops.)
If done properly this gives you a tool where at each choke point the package has some number of high priority tasks ready to run and a blocking loop that keeps it from proceeding down the lower priority branches until said high priority items are running. Once all of the high priority tasks have been started the loop clears and allows any remaining threads to move on to lower priority tasks. Worst case is one thread sits in the loop while waiting for other threads to come along and pick up the high priority tasks.
The major drawback to this approach is the risk of deadlocking the package if you have too many blocking loops get queued up at the same time, or misread your dependencies and have loops waiting for tasks that never start. Careful analysis is needed to decide which items deserved higher priority and where exactly to insert the blocks.

I don't know any elegant ways to do this but my first shot would be something like this..
Sequence Container with the proc that has to run first. In that same sequence container put a script task that just waits 5-10 seconds or so before each of the 10 independent steps can run. Then chain the rest of the procs below that sequence container.

Related

How to solve this scheduling problem with identical workers and task dependencies?

The problem I'm trying to solve, can be expressed as:
A team of N astronauts preparing for the re-entry of their space shuttle
into the atmosphere. There is a set of tasks that must be accomplished by the team before
re-entry. Each task must be carried out by exactly one astronaut, and certain tasks can not
be started until other tasks are completed. Which tasks should be performed by which
astronaut, and in which order, to ensure that the entire set of tasks is accomplished as
quickly as possible?
So:
Each astronaut is able to preform every tasks
Some tasks might depend on each other (e.g. task i must be completed before task j)
Tasks do not have a specific start time or deadline
The objective is to minimize the makespan (time it takes to complete all tasks)
I have found solutions to similar problems (e.g. Job Shop problem, RCPSP, etc) but non of those problems completely captures the problem described above, as they don't involve worker allocation to the tasks, i.e. the solution assumes specific workers must work on specific tasks.

How to run Pentaho transformations in parallel and limit executors count

The task is to run defined number of transformations (.ktr) in parallel.
Each transformation opens it's own database connection to read data.
But we have a limitation on given user, who has only 5 allowed parallel connection to DB and let's consider that this could not be changed.
So when I start job depicted below, only 5 transformations finish their work successfully, and other 5 fails with db connection error.
I know that there is an option to redraw job scheme to have only 5 parallel sequences, but I don't like this approach, as it requires reimplementation when count of threads changes.
Is it possible to configure some kind of pool of executors, so Pentaho job will understand that even if there were 10 transformations provided, only random 5 could be processed in parallel?
I am assuming that you know the number of parallel database connections available. If you know this, use switch/case component and then number of transformations in parallel. Second option is to use job-executor.In Job Executor, if you can set variable which in turn call the job accordingly. For example, you are calling a job using job-executor with value
c:/data-integrator/parallel_job_${5}.kjb where 5 is number of connections available
or
c:/data-integrator/parallel_job_${7}.kjb where 7 is number of connections available
Is this making sense to you.
The concept is the following:
Catch database connection error during transformation run attempt
Wait a couple of seconds
Retry run of a transformation
Look at attached transformation picture. It works for me.
Disadvantages:
A lot of connection errors in the logs, which could confuse.
Given solution could turn in infinite loop (but could be amended to avoid it)

Distributed workers that ensure a single instance of a task is running

I need to design a distributed system a scheduler sends tasks to workers in multiple nodes. Each task is assigned an id, and it could be executed more than once, scheduled by the scheduler (usually once per hour).
My only requirement is that a task with a specific id should not be executed twice at the same time by the cluster. I can think of a design where the scheduler holds a lock for each task id and sends the task to an appropriate worker. Once the worker has finished the lock should be released and the scheduler might schedule it again.
What should my design include to ensure this. I'm concerned about cases where a task is sent to a worker which starts the task but then fails to inform the scheduler about it.
What would be the best practice in this scenario to ensure that only a single instance of a job is always executed at a time?
You could use a solution that implements a consensus protocol. Say - for example - that all your nodes in the cluster can communicate using the Raft protocol. As such, whenever a node X would want to start working on a task Y it would attempt to commit a message X starts working on Y. Once such messages are committed to the log, all the nodes will see all the messages in the log in the same order.
When node X finishes or aborts the task it would attempt to commit X no longer works on Y so that another node can start/continue working on it.
It could happen that two nodes (X and Z) may try to commit their start messages concurrently, and the log would then look something like this:
...
N-1: ...
N+0: "X starts working on Y"
...
N+k: "Z starts working on Y"
...
But since there is no X no longer works on Y message between the N+0 and N+k entry, every node (including Z) would know that Z must not start the work on Y.
The only remaining problem would be if node X got partitioned from the cluster before it can attempt to commit its X no longer works on Y for which I believe there is no perfect solution.
A work-around could be that X would try to periodically commit a message X still works on Y at time T and if no such message was committed to the log for some threshold duration, the cluster would assume that no one is working on that task anymore.
With this work-around however, you'd be allowing the possibility that two or more nodes will work on the same task (the partitioned node X and some new node that picks up the task after the timeout).
After some thorough search, I came to the conclusion that this problem can be solved through a method called fencing.
In essence, when you suspect that a node (worker) failed, the only way to ensure that it will not corrupt the rest of the system is to provide a fence that will stop the node from accessing the shared resource you need to protect. That must be a radical method like resetting the machine that runs the failed process or setup a firewall rule that will prevent the process from accessing the shared resource. Once the fence is in place, then you can safely break the lock that was being held by the failed process and start a new process.
Another possibility is to use a relational database to store task metadata + proper isolation level (can't go wrong with serializable if performance is not your #1 priority).
SERIALIZABLE
This isolation level specifies that all transactions occur in a completely isolated fashion; i.e., as if all transactions in the system had executed serially, one after the other. The DBMS may execute two or more transactions at the same time only if the illusion of serial execution can be maintained.
Use either optimistic or pessimistic locking should work too. https://learning-notes.mistermicheels.com/data/sql/optimistic-pessimistic-locking-sql/
In case you need a rerun of the task, simply update the metadata. (or I would recommend to create a new task with different metadata to keep track of its execution history)

How to create a distributed 'debounce' task to drain a Redis List?

I have the following usecase: multiple clients push to a shared Redis List. A separate worker process should drain this list (process and delete). Wait/multi-exec is in place to make sure, this goes smoothly.
For performance reasons I don't want to call the 'drain'-process right away, but after x milliseconds, starting from the moment the first client pushes to the (then empty) list.
This is akin to a distributed underscore/lodash debounce function, for which the timer starts to run the moment the first item comes in (i.e.: 'leading' instead of 'trailing')
I'm looking for the best way to do this reliably in a fault tolerant way.
Currently I'm leaning to the following method:
Use Redis Set with the NX and px method. This allows:
to only set a value (a mutex) to a dedicated keyspace, if it doesn't yet exist. This is what the nx argument is used for
expires the key after x milliseconds. This is what the px argument is used for
This command returns 1 if the value could be set, meaning no value did previously exist. It returns 0 otherwise. A 1 means the current client is the first client to run the process since the Redis List was drained. Therefore,
this client puts a job on a distributed queue which is scheduled to run in x milliseconds.
After x milliseconds, the worker to receive the job starts the process of draining the list.
This works on paper, but feels a bit complicated. Any other ways to make this work in a distributed fault-tolerant way?
Btw: Redis and a distributed queue are already in place so I don't consider it an extra burden to use it for this issue.
Sorry for that, but normal response would require a bunch of text/theory. Because your good question you've already written a good answer :)
First of all we should define the terms. The 'debounce' in terms of underscore/lodash should be learned under the David Corbacho’s article explanation:
Debounce: Think of it as "grouping multiple events in one". Imagine that you go home, enter in the elevator, doors are closing... and suddenly your neighbor appears in the hall and tries to jump on the elevator. Be polite! and open the doors for him: you are debouncing the elevator departure. Consider that the same situation can happen again with a third person, and so on... probably delaying the departure several minutes.
Throttle: Think of it as a valve, it regulates the flow of the executions. We can determine the maximum number of times a function can be called in certain time. So in the elevator analogy you are polite enough to let people in for 10 secs, but once that delay passes, you must go!
Your are asking about debounce sinse first element would be pushed to list:
So that, by analogy with the elevator. Elevator should go up after 10 minutes after the lift came first person. It does not matter how many people crammed into the elevator more.
In case of distributed fault-tolerant system this should be viewed as a set of requirements:
Processing of the new list must begin within X time, after inserting the first element (ie the creation of the list).
The worker crash should not break anything.
Dead lock free.
The first requirement must be fulfilled regardless of the number of workers - be it 1 or N.
I.e. you should know (in distributed way) - group of workers have to wait, or you can start the list processing. As soon as we utter the phrase "distributed" and "fault-tolerant". These concepts always lead with they friends:
Atomicity (eg by blocking)
Reservation
In practice
In practice, i am afraid that your system needs to be a little bit more complicated (maybe you just do not have written, and you already have it).
Your method:
Pessimistic locking with mutex via SET NX PX. NX is a guarantee that only one process at a time doing the work (atomicity). The PX ensures that if something happens with this process the lock is released by the Redis (one part of fault-tolerant about dead locking).
All workers try to catch one mutex (per list key), so just one be happy and would process list after X time. This process can update TTL of mutex (if need more time as originally wanted). If process would crash - the mutex would be unlocked after TTL and be grabbed with other worker.
My suggestion
The fault-tolerant reliable queue processing in Redis built around RPOPLPUSH:
RPOPLPUSH item from processing to special list (per worker per list).
Process item
Remove item from special list
Requirements
So, if worker would crashed we always can return broken message from special list to main list. And Redis guarantees atomicity of RPOPLPUSH/RPOP. That is, there is only a problem group of workers to wait a while.
And then two options. First - if have much of clients and lesser workers use locking on side of worker. So try to lock mutex in worker and if success - start processing.
And vice versa. Use SET NX PX each time you execute LPUSH/RPUSH (to have "wait N time before pop from me" solution if you have many workers and some push clients). So push is:
SET myListLock 1 PX 10000 NX
LPUSH myList value
And each worker just check if myListLock exists we should wait not at least key TTL before set processing mutex and start to drain.

Processing data while it is loading

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.
The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.