SSIS - Connection Management Within a Loop

SSIS - Connection Management Within a Loop - optimization

I have the following SSIS package:
alt text http://www.freeimagehosting.net/uploads/5161bb571d.jpg
The problem is that within the Foreach loop a connection is opened and closed for each iteration.
On running SQL Profiler I see a series of:
Audit Login
RPC:Completed
Audit Logout
The duration for the login and the RPC that actually does the work is minimal. However, the duration for the logout is significant, running into several seconds each. This causes the JOB to run very slowly - taking many hours. I get the same problem when running either on a test server or stand-alone laptop.
Could anyone please suggest how I may change the package to improve performance?
Also, I have noticed that when running the package from Visual Studio, it looks as though it continues to run with the component blocks going amber then green but actually all the processing has been completed and SQL profiler has dropped silent?
Thanks,
Rob.

Have you tried running your data flow task in parallel vs serial? You can most likely break up your for loops to enable you to run each 'set' in parallel, so while it might still be expensive to login/out, you will be doing it N times simultaneously.

SQL Server is most performant when running a batch of operations in a single query. Is it possible to redesign your package so that it batches updates in a single call, rather than having a procedural workflow with for-loops, as you have it here?
If the design of your application and the RPC permits (or can be refactored to permit it), this might be the best solution for performance.
For example, instead of something like:
for each Facility
for each Stock
update Qty
See if you can create a structure (using SQL, or a bulk update RPC with a single connection) like:
update Qty
from Qty join Stock join Facility
...
If you control the implementation of the RPC, the RPC could maintain the same API (if needed) by delegating to another which does the batch operation, but specifies a single-record restriction (where record=someRecord).

Have you tried doing the following?
In your connection managers for the connection that is used within the loop, right click and choose properties. In the properties for the connection, find "RetainSameConnection" and change it to True from the default of False. This will let your package maintain the connection throughout your package run. Your profiler would then probably look like:
Audit Login
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
...
Audit Logout
With the final Audit Logout happening at the end of package execution.

Related

Sql Server 2005 - SSIS statistics per component per run

Coming from a different ETL tool, I'm trying to figure out how to get (production) statistics on each component as it runs in SSIS.
For example, if the flat file is reading from an external source that has a high deviation (the rows/sec changes drastically at different times), I would like to know that information.
If an SSIS has a significant 'slow point' (buffer filling up / data stream impacted), I would also like to know that information.
And using sprocs for example from the DMV's, the CPU time and readIO/writeIO would also be ideal (and useful for people showing improvement by moving from sproc to SSIS in a consistent/measurable approach).
The reason I'm asking this question is I see the rows going through BIDS during debugging, but it may not reflect the actual rows/sec on each component in production.
How would one either enable/introspect/obtain these kinds of statistics for production environments (even if it takes a small hit, the numbers are a big deal).
Thanks!
-Darren

This is difficult to do in SSIS 2005. I have seen the runtime engine "just stop" when trying to perform task-level logging from event handlers in complex SSIS packages. One thought: to instrument the Data Flows only by adding Row Count Transformations just after Source Adapters and on each Data Flow Path that outputs rows. Then add an Execute SQL Task to each Data Flow Task's OnPreExecute event handler to log the start of execution, and add another Execute SQL Task to the corresponding OnPostExecute event handler. In the onPostExecute logic, store the row counts and the end time of the data flow task execution. I believe that will provide enough metrics to calculate throughput for the data flow pipeline.
Hope this helps,
Andy

Not sure if it will help, but maybe you can try to configure logging on your package and select "SSIS log provider for SQL Server Profiler"
It shows several information between begin and end of the DataSource Processing

Run a SQL command in the event my connection is broken? (SQL Server)

Here's the sequence of events my hypothetical program makes...
Open a connection to server.
Run an UPDATE command.
Go off and do something that might take a significant amount of time.
Run another UPDATE that reverses the change in step 2.
Close connection.
But oh-no! During step 3, the machine running this program literally exploded. Other machines querying the same database will now think that the exploded machine is still working and doing something.
What I'd like to do, is just as the connection is opened, but before any changes have been made, tell the server that should this connection close for whatever reason, to run some SQL. That way, I can be sure that if something goes wrong, the closing update will run.
(To pre-empt the answer, I'm not looking for table/record locks or transactions. I'm not doing resource claims here.)
Many thanks, billpg.

I'm not sure there's anything built in, so I think you'll have to do some bespoke stuff...
This is totally hypothetical and straight off the top of my head, but:
Take the SPID of the connection you
opened and store it in some temp
table, with the text of the reversal
update.
Use an a background process (either
SSIS or something else) to monitor
the temp table and check that the
SPID is still present as an open connection.
If the connection dies then the background process can execute the stored revert command
If the connection completes properly then the SPID can be removed from the temp table so that the background process no longer reverts it when the connection closes.
Comments or improvements welcome!

I'll expand on my comment. In general, I think you should reconsider your approach. All database access code should open a connection, execute a query then close the connection where you rely on connection pooling to mitigate the expense of opening lots of database connections.
If it is the case that we are talking about a single SQL command whose rows on which it operates should not change, that is a problem that should be handled by the transaction isolation level. For that you might investigate the Snapshot isolation level in SQL Server 2005+.
If we are talking about a series of queries that are part of a long running transaction, that is more complicated and can be handled via storage of a transaction state which other connections read in order to determine whether they can proceed. Going down this road, you need to provide users with tools where they can cancel a long running transaction that might no longer be applicable.

Assuming it's even possible... this will only help you if the client machine explodes during the transaction. Also, there's a risk of false positives - the connection might get dropped for a few seconds due to network noise.
The approach that I'd take is to start a process on another machine that periodically pings the first one to check if it's still on-line, then takes action if it becomes unreachable.

Can Sql Server 2008 Stored Procedures (or Triggers) manually parallel or background some logic?

If i have a stored procedure or a trigger in Sql Server 2008, can it do some sql calculations 'in another non-blocking thread'? ie. something in the background
also, can two sql code blocks be ran in parallel? or two stored procs be ran in parallel?
for example. Imagine we are given the job calculating the scores for each Stack Overflow user (and please leave all 'do that elsehwere/service/batch/overnight/etc, elswhere) after a user does some 'action'.
so we have a trigger on the Post table, so when a new post is INSERTED, the trigger fires off and part of that logic, it calculates the user's latest score. Instead of waiting for the stored proc to finish and block the current sql thread / executire, can we ask it to calc the score in the background OR parallel.
cheers!

SQL Server does not have parallel or deferred execution: each block of running code in a connection is serial, one line after the other.
To decouple processing, you usually have to use SQL Server Agent jobs or use Service broker. These start executing in a new connection, new session etc
This makes sense:
What if you want to rollback your changes? What does the background thread do and how does it know?
What data does it use? New, Old, lock wait, snapshot?
What if it gets ahead of the main thread and uses stale data?

No, but you could write the request to a queue. Service Broker, a SQL Server component, provides support for this kind of thing. It's probably the best option available for asynchronous processing.

Database Job Scheduling

I have a procedure written in PLJava that sends out updates over JMS in my postgres database.
What I would like to do is have that function called on an interval (every 15 seconds) internally in the database (preferably not from an outside process). Is this possible? Any ideas?

If you need no external access, you are presumably able to modify the database design so that you don't need the update at all. Can you explain more about what the update is doing?
As depesz said, you could use either cron or pgAgent, but they are only able to go down to a one minute granularity, not 15 seconds. Considering sleeping inside the stored procedure until the next iteration is not a good idea, because you will have an open transaction for all that time which is a really bad idea.

Strict answer: it is not possible. Since you don't want outside process, and PostgreSQL doesn't support jobs - you are out of luck.
If you'll reconsider using outside processes, then you're most likely want something like cron, or better yet pgagent.
On absolutely other hand - what do you need to do that has to happen every 30 seconds? this seems like a problem with design.

First, you'll spend the least amount of effort if you just go with a cron job.
However, if you were starting from scracth: You are trying to periodically replicate rows from your database. I think you are looking at a replication queue.
The PGQ project (used for Londiste replication, both from Skype's SkyTools) has a queue that you can use independently. When configuring it, you set a maximum event count, and a loop delay, before batched events are generated. You can get batches spaced by no more than 15 seconds that way. You now have to produce the events that will be batched, using a trigger that calls pgq.insert_event; and consume the queues. The consumer can call your PL/Java stored proc; you'll have to rewrite the procedure to send everything in the batch instead of scanning the base table for new events.

As far as I know postgresql doesn't support scheduled tasks. You'll need to use a script with cron or at (depending on your operating system.)

Sounds like you're doing sort of replication? Every 15s sounds like a lot of updates. Could you setup a trigger (or a number of triggers) instead of polling?

If you are using JMS why not just have th task wait for input on the queue?

Per your depesz comment, you have a PL/Java stored procedure that "flushes out database tables (updates) as java objects". Since you want it to run in 15 second intervals, it must be processing a batch of updates each time. Rather than processing a batch of updates in a stored procedure every 15 seconds, why not process them one at a time when they happen via an after update trigger and eliminate the need for a timed interval. If you are aggregrating data from multiple tables to build your objects than add the triggers to you upper most tables only.

In my case the problem was that agent couldn't authorize to database so after I've made all connections trusted from localhost the service started successfully and job works fine
for more information about error you should see into windows event viewer or eq in unix based system. see my config file C:\Program Files\PostgreSQL\10\data\pg_hba.conf

Start stored procedures sequentially or in parallel

We have a stored procedure that runs nightly that in turn kicks off a number of other procedures. Some of those procedures could logically be run in parallel with some of the others.
How can I indicate to SQL Server whether a procedure should be run in parallel or serial — ie: kicked off of asynchronously or blocking?
What would be the implications of running them in parallel, keeping in mind that I've already determined that the processes won't be competing for table access or locks- just total disk io and memory. For the most part they don't even use the same tables.
Does it matter if some of those procedures are the same procedure, just with different parameters?
If I start a pair or procedures asynchronously, is there a good system in SQL Server to then wait for both of them to finish, or do I need to have each of them set a flag somewhere and check and poll the flag periodically using WAITFOR DELAY?
At the moment we're still on SQL Server 2000.
As a side note, this matters because the main procedure is kicked off in response to the completion of a data dump into the server from a mainframe system. The mainframe dump takes all but about 2 hours each night, and we have no control over it. As a result, we're constantly trying to find ways to reduce processing times.

I had to research this recently, so found this old question that was begging for a more complete answer. Just to be totally explicit: TSQL does not (by itself) have the ability to launch other TSQL operations asynchronously.
That doesn't mean you don't still have a lot of options (some of them mentioned in other answers):
Custom application: Write a simple custom app in the language of your choice, using asynchronous methods. Call a SQL stored proc on each application thread.
SQL Agent jobs: Create multiple SQL jobs, and start them asynchronously from your proc using sp_start_job. You can check to see if they have finished yet using the undocumented function xp_sqlagent_enum_jobs as described in this excellent article by Gregory A. Larsen. (Or have the jobs themselves update your own JOB_PROGRESS table as Chris suggests.) You would literally have to create separate job for each parallel process you anticipate running, even if they are running the same stored proc with different parameters.
OLE Automation: Use sp_oacreate and sp_oamethod to launch a new process calling the other stored proc as described in this article, also by Gregory A. Larsen.
DTS Package: Create a DTS or SSIS package with a simple branching task flow. DTS will launch tasks in individual spids.
Service Broker: If you are on SQL2005+, look into using Service Broker
CLR Parallel Execution: Use the CLR commands Parallel_AddSql and Parallel_Execute as described in this article by Alan Kaplan (SQL2005+ only).
Scheduled Windows Tasks: Listed for completeness, but I'm not a fan of this option.
I don't have much experience with Service Broker or CLR, so I can't comment on those options. If it were me, I'd probably use multiple Jobs in simpler scenarios, and a DTS/SSIS package in more complex scenarios.
One final comment: SQL already attempts to parallelize individual operations whenever it can*. This means that running 2 tasks at the same time instead of after each other is no guarantee that it will finish sooner. Test carefully to see whether it actually improves anything or not.
We had a developer that created a DTS package to run 8 tasks at the same time. Unfortunately, it was only a 4-CPU server :)
*Assuming default settings. This can be modified by altering the server's Maximum Degree of Parallelism or Affinity Mask, or by using the MAXDOP query hint.

Create a couple of SQL Server agent jobs where each one runs a particular proc.
Then from within your master proc kick off the jobs.
The only way of waiting that I can think of is if you have a status table that each proc updates when it's finished.
Then yet another job could poll that table for total completion and kick off a final proc. Alternatively, you could have a trigger on this table.
The memory implications are completely up to your environment..
UPDATE:
If you have access to the task system.. then you could take the same approach. Just have windows execute multiple tasks, each responsible for one proc. Then use a trigger on the status table to kick off something when all of the tasks have completed.
UPDATE2:
Also, if you're willing to create a new app, you could house all of the logic in a single exe...

You do need to move your overnight sprocs to jobs. SQL Server job control will let you do all of the scheduling you are asking for.

You might want to look into using DTS (which can be run from the SQL Agent as a job). It will allow you pretty fine control over which stored procedures need to wait for others to finish and what can run in parallel. You can also run the DTS package as an EXE from your own scheduling software if needed.
NOTE: You will need to create multiple copies of your connection objects to allow calls to run in parallel. Two calls using the same connection object will still block each other even if you don't explicitly put in a dependency.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SSIS - Connection Management Within a Loop - optimization

Have you tried running your data flow task in parallel vs serial? You can most likely break up your for loops to enable you to run each 'set' in parallel, so while it might still be expensive to login/out, you will be doing it N times simultaneously.

Related

Sql Server 2005 - SSIS statistics per component per run

Run a SQL command in the event my connection is broken? (SQL Server)

Can Sql Server 2008 Stored Procedures (or Triggers) manually parallel or background some logic?

Database Job Scheduling

Start stored procedures sequentially or in parallel

Categories

Resources