Sql Server 2005 - SSIS statistics per component per run - sql-server-2005

Coming from a different ETL tool, I'm trying to figure out how to get (production) statistics on each component as it runs in SSIS.
For example, if the flat file is reading from an external source that has a high deviation (the rows/sec changes drastically at different times), I would like to know that information.
If an SSIS has a significant 'slow point' (buffer filling up / data stream impacted), I would also like to know that information.
And using sprocs for example from the DMV's, the CPU time and readIO/writeIO would also be ideal (and useful for people showing improvement by moving from sproc to SSIS in a consistent/measurable approach).
The reason I'm asking this question is I see the rows going through BIDS during debugging, but it may not reflect the actual rows/sec on each component in production.
How would one either enable/introspect/obtain these kinds of statistics for production environments (even if it takes a small hit, the numbers are a big deal).
Thanks!
-Darren

This is difficult to do in SSIS 2005. I have seen the runtime engine "just stop" when trying to perform task-level logging from event handlers in complex SSIS packages. One thought: to instrument the Data Flows only by adding Row Count Transformations just after Source Adapters and on each Data Flow Path that outputs rows. Then add an Execute SQL Task to each Data Flow Task's OnPreExecute event handler to log the start of execution, and add another Execute SQL Task to the corresponding OnPostExecute event handler. In the onPostExecute logic, store the row counts and the end time of the data flow task execution. I believe that will provide enough metrics to calculate throughput for the data flow pipeline.
Hope this helps,
Andy

Not sure if it will help, but maybe you can try to configure logging on your package and select "SSIS log provider for SQL Server Profiler"
It shows several information between begin and end of the DataSource Processing

Related

Setting timeout for query in oracle

We have data warehouse setting where we use oracle 12c and informatica for ETL. We call some hourly procedures in informatica workflow. Sometimes these procedures take more than one hour for various reasons. Is it possible to set timeout event to generate mail alert at database level or informatica level which will terminate current execution and generate mail alert for the same.
Best Regards
Well... no. This and a bunch of other features are not part of Informatica. Here an external Orchestration tool is very much helpful. One that takes care of file watching and triggering workflows upon file arrival, reports that a workflow runs too long or too short, one that will notify in case a file you expect to get has not been received and so on.

Trapping All Batch Job from MVS

I'm trying to trap all the batch Job from MVS.
I want to transmit all the batch job information (start,end,error) to an external system in order to conduct further analysis.
Has anyone got any idea on how to do this ?
Write an IEFACTRT exit (or whatever its modern day equivalent is) and have the systems programmers install it.
IBM actually provides a facility for this. You can have it write SMF (System Management Facility) records for all jobs. The record layouts are available and you can write code to do analysis on them or you can get 3rd party products like OmegaMon that will do the analysis and reporting for you.
as in my shop, we print the job info into plain files, and ftp down to some file servers and from where we run extract/format with some scripts and pull the data into BI platform for later analysis/visualisation.
Currently, we are studying to utilise the power of graph db like Neo4j to deeper understand our batch job relationship/better present the job relationship with people who interested. and for now we think graph db is a very neat tool for such kind of thing(batch job management)...
Hope my answer can give you some inspiration/reminders...
Typically, installations cut SMF type 30 records. Subtype 1 is written when a new transaction is started. transaction means, System Resources Manager (SRM) transaction. Don't confuse it with transactions in the context of e.g. a database system. A batch job that begins execution is such a transaction. Subtype 5 is written when a transaction ends. Along with subtype 5, there is a completion section that reports the job termination status.
Now, SMF processing is traditionally done in batch as you have to prepare the SMF records first either by extracting them from the log stream or from one of the SYS1.MANx data sets.
But recently, capabilities have been added to z/OS that allow you to hook into the process when SMF records are written. A product like the IBM Common Data Provider for z/OS can be used transform the data in the way you want it to be and to stream it to a destination of choice, for instance logstash. Following such a technique allows to process SMF records almost in real time.

SSIS 2008 with Stored Procedures. Transaction best practices

I'm currenlty using SSIS for some process flow, scripting, and straight data import. Most of the data cleaning and transformation is happening within stored procedures that I'm calling from SSIS execute SQL tasks. For most of the sprocs, if it fails for any reason, I don't really care about rolling back any transactions. My SSIS error handling essentially wipes out any staging data and then logs the errors to a table. (A human needs to fix the underlying data issue at that point)
My question revolves around begin tran, end tran. Are there any cases where a stored proc can fail, and then not let the calling SSIS process know? I'm looking for hardware failure, lock timeouts, etc.
I'd prefer to avoid using transactions as much as possible and rely on my SSIS error handling.
Thoughts?
Once case I can think of (and transactions won't help either) would be if the stored proc did not update or insert any records. That would not be a failure, but it might need to be for an SSIS package. You might want to return how many rows were affected and check that after.
We also do this for some imports where a number significantly off from the last import indicates a data problem. So if we usually get 100,000 records from client A in Import B and we get 5000 instead, the SSIS package fails until a human can look at it and see it the file is bad or if they genuiinely did mean to reduce their work force or customer list.
Incidentally we stage to two tables (one with the raw unchanged data and one we use for cleaning. A failure of the SSIS package should not roll those back if you want to easily see what the data issues was. You can then tell if the data was wrong from the start or if somehow it got lost or fixed incorrectly inteh cleaning process. Sometimes the place where the error got logged is not the place where the error actually occurred and it is nice to see what the data looked like unchanged and after the change process. Sometimes you have bad data, yes (Ok the majority of times) but sometimes you have a bug. Having both those tables enables you to uickly see which of the two it is.
You could have all your procs insert to a logging table as the last step and make sure that the record is there before executing the next step if you are concerned that you are losing some executions that are not bubbling back to the package.

SSIS - Connection Management Within a Loop

I have the following SSIS package:
alt text http://www.freeimagehosting.net/uploads/5161bb571d.jpg
The problem is that within the Foreach loop a connection is opened and closed for each iteration.
On running SQL Profiler I see a series of:
Audit Login
RPC:Completed
Audit Logout
The duration for the login and the RPC that actually does the work is minimal. However, the duration for the logout is significant, running into several seconds each. This causes the JOB to run very slowly - taking many hours. I get the same problem when running either on a test server or stand-alone laptop.
Could anyone please suggest how I may change the package to improve performance?
Also, I have noticed that when running the package from Visual Studio, it looks as though it continues to run with the component blocks going amber then green but actually all the processing has been completed and SQL profiler has dropped silent?
Thanks,
Rob.
Have you tried running your data flow task in parallel vs serial? You can most likely break up your for loops to enable you to run each 'set' in parallel, so while it might still be expensive to login/out, you will be doing it N times simultaneously.
SQL Server is most performant when running a batch of operations in a single query. Is it possible to redesign your package so that it batches updates in a single call, rather than having a procedural workflow with for-loops, as you have it here?
If the design of your application and the RPC permits (or can be refactored to permit it), this might be the best solution for performance.
For example, instead of something like:
for each Facility
for each Stock
update Qty
See if you can create a structure (using SQL, or a bulk update RPC with a single connection) like:
update Qty
from Qty join Stock join Facility
...
If you control the implementation of the RPC, the RPC could maintain the same API (if needed) by delegating to another which does the batch operation, but specifies a single-record restriction (where record=someRecord).
Have you tried doing the following?
In your connection managers for the connection that is used within the loop, right click and choose properties. In the properties for the connection, find "RetainSameConnection" and change it to True from the default of False. This will let your package maintain the connection throughout your package run. Your profiler would then probably look like:
Audit Login
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
RPC:Completed
...
Audit Logout
With the final Audit Logout happening at the end of package execution.

Spawning multiple SQL tasks in SQL Server 2005

I have a number of stored procs which I would like to all run simultaneously on the server. Ideally all on the server without reliance on connections to an external client.
What options are there to launch all these and have them run simultaneously (I don't even need to wait until all the processes are done to do additional work)?
I have thought of:
Launching multiple connections from
a client, having each start the
appropriate SP.
Setting up jobs for
each SP and starting the jobs from a
SQL Server connection or SP.
Using
xp_cmdshell to start additional runs
equivalent to osql or whetever
SSIS - I need to see if the package can be dynamically written to handle more SPs, because I'm not sure how much access my clients are going to get to production
In the job and cmdshell cases, I'm probably going to run into permissions level problems from the DBA...
SSIS could be a good option - if I can table-drive the SP list.
This is a datawarehouse situation, and the work is largely independent and NOLOCK is universally used on the stars. The system is an 8-way 32GB machine, so I'm going to load it down and scale it back if I see problems.
I basically have three layers, Layer 1 has a small number of processes and depends on basically all the facts/dimensions already being loaded (effective, the stars are a Layer 0 - and yes, unfortunately they will all need to be loaded), Layer 2 has a number of processes which depend on some or all of Layer 1, and Layer 3 has a number of processes which depend on some or all of Layer 2. I have the dependencies in a table already, and would only initially launch all the procs in a particular layer at the same time, since they are orthogonal within a layer.
Is SSIS an option for you? You can create a simple package with parallel Execute SQL tasks to execute the stored procs simultaneously. However, depending on what your stored procs do, you may or may not get benefit from starting this in parallel (e.g. if they all access the same table records, one may have to wait for locks to be released etc.)
At one point I did some architectural work on a product known as Acumen Advantage that has a warehouse manager that does this.
The basic strategy for this is to have a control DB with a list of the sprocs and their dependencies. Based on the dependencies you can do a Topological Sort to give them an order to run in. If you do this, you need to manage the dependencies - all of the predecessors of a stored procedure must complete before it executes. Just starting the sprocs in order on multiple threads will not accomplish this by itself.
Implementing this meant knocking much of the SSIS functionality on the head and implementing another scheduler. This is OK for a product but probably overkill for a bespoke system. A simpler solution is thus:
You can manage the dependencies at a more coarse-grained level by organising the ETL vertically by dimension (sometimes known as Subject Oriented ETL) where a single SSIS package and set of sprocs takes the data from extraction through to producing dimensions or fact tables. Typically the dimensions will mostly be siloed, so they will have minimal interdependency. Where there is interdependency, make one dimension (or fact table) load process dependent on whatever it needs upstream.
Each loader becomes relatively modular and you still get a useful degree of parallelism by kicking off the load processes in parallel and letting the SSIS scheduler work it out. The dependencies will contain some redundancy. For example an ODS table may not be dependent on a dimension load being completed but the upstream package itself takes the components right through to the dimensional schema before it completes. However this is not likely to be an issue in practice for the following reasons:
The load process probably has plenty of other tasks that can execute in the meantime
The most resource-hungry tasks will almost certainly be the fact table loads, which will mostly not be dependent on each other. Where there is a dependency (e.g. a rollup table based on the contents of another table) this cannot be avoided anyway.
You can construct the SSIS packages so they pick up all of their configuration from an XML file and the location can be supplied exernally in an environment variable. This sort of thing can be fairly easily implemented with scheduling systems like Control-M.
This means that a modified SSIS package can be deployed with relatively little manual intervention. The production staff can be handed the packages to deploy along with the stored procedures and can mainain the config files on a per-environment basis without having to manually fiddle configuration in the SSIS packages.
you might want to look at the service broker and it's activation stored procedures... might be an option...
In the end, I created a C# management console program which launches the processes Async as they are able to be run and keeps track of the connections.