Spawning multiple SQL tasks in SQL Server 2005 - sql

I have a number of stored procs which I would like to all run simultaneously on the server. Ideally all on the server without reliance on connections to an external client.
What options are there to launch all these and have them run simultaneously (I don't even need to wait until all the processes are done to do additional work)?
I have thought of:
Launching multiple connections from
a client, having each start the
appropriate SP.
Setting up jobs for
each SP and starting the jobs from a
SQL Server connection or SP.
Using
xp_cmdshell to start additional runs
equivalent to osql or whetever
SSIS - I need to see if the package can be dynamically written to handle more SPs, because I'm not sure how much access my clients are going to get to production
In the job and cmdshell cases, I'm probably going to run into permissions level problems from the DBA...
SSIS could be a good option - if I can table-drive the SP list.
This is a datawarehouse situation, and the work is largely independent and NOLOCK is universally used on the stars. The system is an 8-way 32GB machine, so I'm going to load it down and scale it back if I see problems.
I basically have three layers, Layer 1 has a small number of processes and depends on basically all the facts/dimensions already being loaded (effective, the stars are a Layer 0 - and yes, unfortunately they will all need to be loaded), Layer 2 has a number of processes which depend on some or all of Layer 1, and Layer 3 has a number of processes which depend on some or all of Layer 2. I have the dependencies in a table already, and would only initially launch all the procs in a particular layer at the same time, since they are orthogonal within a layer.

Is SSIS an option for you? You can create a simple package with parallel Execute SQL tasks to execute the stored procs simultaneously. However, depending on what your stored procs do, you may or may not get benefit from starting this in parallel (e.g. if they all access the same table records, one may have to wait for locks to be released etc.)

At one point I did some architectural work on a product known as Acumen Advantage that has a warehouse manager that does this.
The basic strategy for this is to have a control DB with a list of the sprocs and their dependencies. Based on the dependencies you can do a Topological Sort to give them an order to run in. If you do this, you need to manage the dependencies - all of the predecessors of a stored procedure must complete before it executes. Just starting the sprocs in order on multiple threads will not accomplish this by itself.
Implementing this meant knocking much of the SSIS functionality on the head and implementing another scheduler. This is OK for a product but probably overkill for a bespoke system. A simpler solution is thus:
You can manage the dependencies at a more coarse-grained level by organising the ETL vertically by dimension (sometimes known as Subject Oriented ETL) where a single SSIS package and set of sprocs takes the data from extraction through to producing dimensions or fact tables. Typically the dimensions will mostly be siloed, so they will have minimal interdependency. Where there is interdependency, make one dimension (or fact table) load process dependent on whatever it needs upstream.
Each loader becomes relatively modular and you still get a useful degree of parallelism by kicking off the load processes in parallel and letting the SSIS scheduler work it out. The dependencies will contain some redundancy. For example an ODS table may not be dependent on a dimension load being completed but the upstream package itself takes the components right through to the dimensional schema before it completes. However this is not likely to be an issue in practice for the following reasons:
The load process probably has plenty of other tasks that can execute in the meantime
The most resource-hungry tasks will almost certainly be the fact table loads, which will mostly not be dependent on each other. Where there is a dependency (e.g. a rollup table based on the contents of another table) this cannot be avoided anyway.
You can construct the SSIS packages so they pick up all of their configuration from an XML file and the location can be supplied exernally in an environment variable. This sort of thing can be fairly easily implemented with scheduling systems like Control-M.
This means that a modified SSIS package can be deployed with relatively little manual intervention. The production staff can be handed the packages to deploy along with the stored procedures and can mainain the config files on a per-environment basis without having to manually fiddle configuration in the SSIS packages.

you might want to look at the service broker and it's activation stored procedures... might be an option...

In the end, I created a C# management console program which launches the processes Async as they are able to be run and keeps track of the connections.

Related

SQL Server sp_clean_db_free_space

I just want to ask if I can use the stored procedure sp_clean_db_free_space as part of preventive maintenance?
What are the pros and cons of this built in stored procedure? I just want to hear the inputs of someone who already uses this.
Thank you
You would typically not need to explicitly run this, unless you have specific reasons.
Quoting from SQL Server Books Online's page for sp_clean_db_free_space:
Delete operations from a table or update operations that cause a row
to move can immediately free up space on a page by removing references
to the row. However, under certain circumstances, the row can
physically remain on the data page as a ghost record. Ghost records
are periodically removed by a background process. This residual data
is not returned by the Database Engine in response to queries.
However, in environments in which the physical security of the data or
backup files is at risk, you can use sp_clean_db_free_space to clean
these ghost records.
Notice that SQL Server already has a background process to achieve the same result as sp_clean_db_free_space.
The only reason you might wish to explicitly run sp_clean_db_free_space is if there is a risk that the underlying database files or backups can be compromised and analysed. In such cases, any data that has not yet been swept up by the background process can be exposed. Of course, if your system has been compromised in such a way, you probably also have bigger problems on your hands!
Another reason might be that you have a time-bound requirement that deleted data should not be retained in any readable form. If you have only a general requirement that is not time-bound, then it would be acceptable to wait for the regular background process to perform this automatically.
The same page also mentions:
Because running sp_clean_db_free_space can significantly affect I/O
activity, we recommend that you run this procedure outside usual
operation hours.
Before you run sp_clean_db_free_space, we recommend
that you create a full database backup.
To summarize:
You'd use the sp_clean_db_free_space stored procedure only if you have a time-bound requirement that deleted data should not be retained in any readable form.
Running sp_clean_db_free_space is IO intensive.
Microsoft recommend a full database backup prior to this, which has its own IO and space requirements.
Take a look at this related question on dba.stackexchange.com: https://dba.stackexchange.com/questions/11280/how-can-i-truly-delete-data-from-a-sql-server-table-still-shows-up-in-notepad/11281

Why cleanup a DB after a test run?

I have several test suites that read and write data from a dedicated database when they are run. My strategy is to assume that the DB is in an unreliable state before a test is run and if I need certain records in certain tables or an empty table I do that setup before the test is run.
My attitude is to not cleanup the DB at the end of each test suite because each test suite should do a cleanup and setup before it runs. Also, if I'm trying to "visually" debug a test suite it helps that the final state of the DB persists after the tests have completed.
Is there a compelling reason to cleanup a DB after your tests have run?
Depends on your tests, what happens after your tests, and how many people are doing testing.
If you're just testing locally, then no, cleaning up after yourself isn't as important ~so long as~ you're consistently employing this philosophy AND you have a process in place to make sure the database is in a known-good state before doing something other than testing.
If you're part of a team, then yes, leaving your test junk behind can screw up other people/processes, and you should clean up after yourself.
In addition to the previous answer I'd like to also mention that this is more suitable when executing Integration tests. Since Integrated modules work together and in conjunction with infrastructure such as message queues and databases + each independent part works correctly with the services it depends on.
This
cleanup a DB after a test run
helps you to Isolate Test Data. A best practice here is to use transactions for database-dependent tests (e.g.,component tests) and roll back the transaction when done. Use a small subset of data to effectively test behavior. Consider it as Database Sandbox – using the Isolate Test Data pattern. E.g. each developer can use this lightweight DML to populate his local database sandboxes to expedite test execution.
Another advantage is that you Decouple your Database, so ensure that application is backward and forward compatible with your database so you can deploy each independently. Patterns like Encapsulate Table with View, and NoSQL databases ensure that you can deploy two application versions at once without either one of them throwing database-related errors. It was particularly successful in a project where it was imperative to access the database using stored procedures.
All this is actually one of the concepts that is used in Virtual test labs.
In addition to above answers, I'll add few more points:
DB shouldn't be cleaned after test because thats where you've your test data, test results and all history which can be referred later on.
DB should be cleaned only if you are changing some application setting to run your / any specific test, so that it shouldn't impact other tester.

How to continuously delivery SQL-based app?

I'm looking to apply continuous delivery concepts to web app we are building, and wondering if there any solution to protecting the database from accidental erroneous commit. For example, a bug that erases whole table instead of a single record.
How this issue impact can be limited according to continuous delivery doctorine, where the application deployed gradually over segments of infrastructure?
Any ideas?
Well first you cannot tell just from looking what is a bad SQL statement. You might have wanted to delete the entire contents of the table. Therefore is is not physiucally possible to have an automated tool that detects intent.
So to protect your database, first make sure you are in full recovery (not simple) mode and have full backups nightly and transaction log backups every 15 minutes or so. Now you cannot lose much information no matter how badly the process breaks. Your dbas should be trained to be able to recover to a point in time. If you don't have any dbas, I'd suggest the best thing you can do to protect your data is hire some. This is a non-negotiable in any non-trivial database environment and it is terribly risky not to have trained, experienced dbas if your data is critical to the business.
Next, you need to treat SQL like any other code, it should be in source control in scripts. If you are terribly concerned about accidental deletions, then write the scripts for deletes to copy all deletes to a staging table and delete the content of the staging table once a week or so. Enforce this convention in the code reviews. Or better yet set up an auditing process that runs through triggers. Once all records are audited, it is much easier to get back the 150 accidental deletions without having to restore a database. I would never consider having any enterprise application without auditing.
All SQL scripts without exception should be code-reviewed just like other code. All SQL scripts should be tested on QA and passed before moving to porduction. This will greatly reduce the possiblility for error. No developer should have write rights to production, only dbas should have that. Therefore each script should be written so that is can just be run, not run one chunk at a time where you could accidentally forget to highlight the where clause. Train your developers to use transactions correctly in the scripts as well.
Your concern is bad data happening to the database. The solution is to use full logging of all transactions so you can back out of transactions that you want to. This would usually be used in a context of full backups/incremental backups/full logging.
SQL Server, for instance, allows you to restore to a point in time (http://msdn.microsoft.com/en-us/library/ms190982(v=sql.105).aspx), assuming you have full logging.
If you are creating and dropping tables, this could be an expensive solution, in terms of space needed for the log. However, it might meet your needs for development.
You may find that full-logging is too expensive for such an application. In that case, you might want to make periodic backups (daily? hourly?) and just keep these around. For this purpose, I've found LightSpeed to be a good product for fast and efficient backups.
One of the strategies that is commonly adopted is to log the incremental sql statements rather than a collective schema generation so you can control the change at a much granular levels:
ex:
change 1:
UP:
Add column
DOWN:
Remove column
change 2:
UP:
Add trigger
DOWN:
Remove trigger
Once the changes are incrementally captured like this, you can have a simple but efficient script to upgrade (UP) from any version to any version without having to worry about the changes that happening. When the change # are linked to build, it becomes even more effective. When you deploy a build the database is also automatically upgraded(UP) or downgraded(DOWN) to that specific build.
We have an pipeline app which does that at CloudMunch.

Sql Server 2005 - SSIS statistics per component per run

Coming from a different ETL tool, I'm trying to figure out how to get (production) statistics on each component as it runs in SSIS.
For example, if the flat file is reading from an external source that has a high deviation (the rows/sec changes drastically at different times), I would like to know that information.
If an SSIS has a significant 'slow point' (buffer filling up / data stream impacted), I would also like to know that information.
And using sprocs for example from the DMV's, the CPU time and readIO/writeIO would also be ideal (and useful for people showing improvement by moving from sproc to SSIS in a consistent/measurable approach).
The reason I'm asking this question is I see the rows going through BIDS during debugging, but it may not reflect the actual rows/sec on each component in production.
How would one either enable/introspect/obtain these kinds of statistics for production environments (even if it takes a small hit, the numbers are a big deal).
Thanks!
-Darren
This is difficult to do in SSIS 2005. I have seen the runtime engine "just stop" when trying to perform task-level logging from event handlers in complex SSIS packages. One thought: to instrument the Data Flows only by adding Row Count Transformations just after Source Adapters and on each Data Flow Path that outputs rows. Then add an Execute SQL Task to each Data Flow Task's OnPreExecute event handler to log the start of execution, and add another Execute SQL Task to the corresponding OnPostExecute event handler. In the onPostExecute logic, store the row counts and the end time of the data flow task execution. I believe that will provide enough metrics to calculate throughput for the data flow pipeline.
Hope this helps,
Andy
Not sure if it will help, but maybe you can try to configure logging on your package and select "SSIS log provider for SQL Server Profiler"
It shows several information between begin and end of the DataSource Processing

Start stored procedures sequentially or in parallel

We have a stored procedure that runs nightly that in turn kicks off a number of other procedures. Some of those procedures could logically be run in parallel with some of the others.
How can I indicate to SQL Server whether a procedure should be run in parallel or serial — ie: kicked off of asynchronously or blocking?
What would be the implications of running them in parallel, keeping in mind that I've already determined that the processes won't be competing for table access or locks- just total disk io and memory. For the most part they don't even use the same tables.
Does it matter if some of those procedures are the same procedure, just with different parameters?
If I start a pair or procedures asynchronously, is there a good system in SQL Server to then wait for both of them to finish, or do I need to have each of them set a flag somewhere and check and poll the flag periodically using WAITFOR DELAY?
At the moment we're still on SQL Server 2000.
As a side note, this matters because the main procedure is kicked off in response to the completion of a data dump into the server from a mainframe system. The mainframe dump takes all but about 2 hours each night, and we have no control over it. As a result, we're constantly trying to find ways to reduce processing times.
I had to research this recently, so found this old question that was begging for a more complete answer. Just to be totally explicit: TSQL does not (by itself) have the ability to launch other TSQL operations asynchronously.
That doesn't mean you don't still have a lot of options (some of them mentioned in other answers):
Custom application: Write a simple custom app in the language of your choice, using asynchronous methods. Call a SQL stored proc on each application thread.
SQL Agent jobs: Create multiple SQL jobs, and start them asynchronously from your proc using sp_start_job. You can check to see if they have finished yet using the undocumented function xp_sqlagent_enum_jobs as described in this excellent article by Gregory A. Larsen. (Or have the jobs themselves update your own JOB_PROGRESS table as Chris suggests.) You would literally have to create separate job for each parallel process you anticipate running, even if they are running the same stored proc with different parameters.
OLE Automation: Use sp_oacreate and sp_oamethod to launch a new process calling the other stored proc as described in this article, also by Gregory A. Larsen.
DTS Package: Create a DTS or SSIS package with a simple branching task flow. DTS will launch tasks in individual spids.
Service Broker: If you are on SQL2005+, look into using Service Broker
CLR Parallel Execution: Use the CLR commands Parallel_AddSql and Parallel_Execute as described in this article by Alan Kaplan (SQL2005+ only).
Scheduled Windows Tasks: Listed for completeness, but I'm not a fan of this option.
I don't have much experience with Service Broker or CLR, so I can't comment on those options. If it were me, I'd probably use multiple Jobs in simpler scenarios, and a DTS/SSIS package in more complex scenarios.
One final comment: SQL already attempts to parallelize individual operations whenever it can*. This means that running 2 tasks at the same time instead of after each other is no guarantee that it will finish sooner. Test carefully to see whether it actually improves anything or not.
We had a developer that created a DTS package to run 8 tasks at the same time. Unfortunately, it was only a 4-CPU server :)
*Assuming default settings. This can be modified by altering the server's Maximum Degree of Parallelism or Affinity Mask, or by using the MAXDOP query hint.
Create a couple of SQL Server agent jobs where each one runs a particular proc.
Then from within your master proc kick off the jobs.
The only way of waiting that I can think of is if you have a status table that each proc updates when it's finished.
Then yet another job could poll that table for total completion and kick off a final proc. Alternatively, you could have a trigger on this table.
The memory implications are completely up to your environment..
UPDATE:
If you have access to the task system.. then you could take the same approach. Just have windows execute multiple tasks, each responsible for one proc. Then use a trigger on the status table to kick off something when all of the tasks have completed.
UPDATE2:
Also, if you're willing to create a new app, you could house all of the logic in a single exe...
You do need to move your overnight sprocs to jobs. SQL Server job control will let you do all of the scheduling you are asking for.
You might want to look into using DTS (which can be run from the SQL Agent as a job). It will allow you pretty fine control over which stored procedures need to wait for others to finish and what can run in parallel. You can also run the DTS package as an EXE from your own scheduling software if needed.
NOTE: You will need to create multiple copies of your connection objects to allow calls to run in parallel. Two calls using the same connection object will still block each other even if you don't explicitly put in a dependency.