Optimizing MSSQL for High Volume Single Record Inserts - sql

We have a C# application that receives a file each day with ~35,000,000 rows. It opens the file, parses each record individually, formats some of the fields and then inserts one record at a time into a table. It's really slow, which is expected, but I've been asked to optimize it.
I have instructed that that any optimizations must be contained to SQL only. i.e., there can be no changes to the process or the C# code. I'm trying tom come up with ideas on how I can speed up this process while being limited to SQL modifications only. I have a couple of ideas I want to try but I'd also like feedback from anyone who has found themselves in this situation before.
Ideas:
1. Create a clustered index on the table so the insert always occurs at the tale end of the file. The records in the file are ordered by date/time and the current table has no clustered index so this seems like a valid approach.
Somehow reduce the logging overheard. This data is volatile in nature so it's not a big deal to be able to rollback. Even if the process blew up halfway through, they would just restart it.
Change the isolation level. Perhaps there is an isolation level that is more suited for sequential single-record inserts.
Reduce connection time. The C# app is opening/closing a connection for each insert. We can't change the C# code though so perhaps there is a trick to reducing overhead/time to make a connection.
I appreciate anyone taking the time to read my post and throw out any ideas they feel would be worth it.
Thanks,
Dean

I would suggest the following -- if possible.
Load the data into a staging table.
Do the transformations in SQL.
Bulk insert the data into the final table.
The second suggestion would be:
Modify the C# code to write the data into a file.
Bulk insert the file, either into a staging table or the final table.
Unfortunately, your problem is 35 million round trips from C# to the database. I doubt there is any database optimization that can fix that performance problem. In other words, you need to change the C# code to fix the performance issue. Anything else is probably just a waste of your time.
You can minimize logging either by using simple recovery or writing to a temporary table. Either of those might help. However, consider the second option, because it would be a minor change to the C# code and could result in big improvements.
Or, if you have to do the best in a really bad situation:
Run the C# code and database on the same server. Be sure it has lots of processors.
Attach lots of SSD or memory for the database (if you are not already using it).
Load the data into table spaces that are only on SSD or in memory.
Copy the data from the local database to the remote one.

Related

Synchronizing tables - does order of UPDATE INSERT DELETE matter?

I need to synchronize tables between 2 databases daily, the source is MSSQL 2008, the target is MSSQL 2005. If I use UPDATE, INSERT, and DELETE statements (i.e. UPDATE rows that changed, INSERT new rows, DELETE rows no longer present), will there be performance improvements if I perform the DELETE statement first? i.e. so that the UPDATE statement doesn't look at rows that don't need to be updated, because they will be deleted.
Here are some other things I need to consider. The tables have 1-3 million+ rows, and because of the amount of transactions and business requirements, the source DB needs to remain online, and the query needs to be as efficient as possible. The job will be run daily in a SQL server agent job on the target DB. On top of that, I am a DB rookie!
Thanks StackOverflow community, you are awesome!
I'd say, first you do delete, then update then insert, so you don't have to update rows which will be deleted anyway and you'll not update rows which are just inserted.
But actually, have you seen SQL Server merge syntax? It could save you a great amount of code.
update I have not checked performance of MERGE statement against INSERT/UPDATE/DELETE, here's related link given by Aaron Bertrand for more details.
Rule of Thumb: DELETE, then UPDATE, then INSERT.
Performance aside, my main concern is to avoid any potential Deadlocks when:
Updating something you will immediately Delete.
Inserting something you may immediately try to Update.
If you only modify what is necessary and use transactions correctly, then you may use any order.
P.S. Someone suggested using MERGE - I've tried it a few times and my preference is to never use it.
I think Roman's answer is what you were looking for in your current situation: DELETE, UPDATE, INSERT (or MERGE.)
Now there are other possible routes which can make things even faster, but with a rather different process:
1. Consider saving all orders in a file that you, once in a while, run against the target
Assuming both databases are exactly the same, for each SQL order that modifies the 2008 database, save that order in a .sql file which you later execute against the 2005 database. You have to consider locking the file while writing to it, and maybe have some kind of redundancy. However, this means you need no access to the 2008 database at all while doing the work on the 2005 database. In other words, no side effects to the 2008 database speed.
Pitfall: you may miss a statement and the destination will not be an exact equivalent...
2. Ongoing replication
I do not know about MSSQL enough to tell you of a good tool to do automatic replication (see here: http://technet.microsoft.com/en-us/library/ms151198.aspx), but I'd bet you can find a good tool. MySQL (http://dev.mysql.com/doc/refman/5.0/en/replication.html) and PostgreSQL (http://wiki.postgresql.org/wiki/Streaming_Replication) have such tools and those are all free.
This would be the solution I would choose. Depending on the tool you use, it can be really very well optimized meaning that the impact on the live system will be minimal and the 2005 duplicate will be up to date within seconds (depending on whether it is a long distance remote connection or not, the amount of work, the setup of each server, the Internet connections, etc.)
The pitfall is obviously that it adds an ongoing process on the database, but if you find an MSSQL tool that works like the streaming replication of PostgreSQL, it makes use of a copy of the journal which means it is dead fast (no heavy use of disk I/O.)
3. Cluster Database (like Cassandra)
This would involve a change of database which I'm totally sure you're not ready to do (especially because most of those systems do not offer SQL,) but I thought that it would be a good thing to talk about in your situation.
A system like Cassandra (http://cassandra.apache.org/) automatically replicate its data on many computers. It can actually be setup to replicate all the data 100% or X% of data per computer with redundancy in case of failure (a computer that breaks down). This alleviates the need for a specific copy on a separate computer because the performance can be increased simply by adding a few nodes to your system. (At less than $1,000 a computer, it is worth it! Frankly you could create a Peta Byte system for $50k or less and end up with something a lot faster than any SQL database...)
The main problem is that the use of those clusters is completely different than SQL. But that could be a solution for big businesses having large databases that need to be really fast and they do not want to invest in a mini-computer (think Cobol and $250k computers that manage 100 million rows in a few milli-seconds...)
With Cassandra you can run extremely heavy batch processes on back end computers that do not make a dent to the front end system!

Improve Log Exceptions

I am planning to use log4net in a new web project. In my experience, I see how big the log table can get, also I notice that errors or exceptions are repeated. For instance, I just query a log table that have more than 132.000 records, and I using distinct and found that only 2.500 records are unique (~2%), the others (~98%) are just duplicates. so, I came up with this idea to improve logging.
Having a couple of new columns: counter and updated_dt, that are updated every time try to insert same record.
If want to track the user that cause the exception, need to create a user_log or log_user table, to map N-N relationship.
Create this model may made the system slow and inefficient trying to compare all these long text... Here the trick, we should also has a hash column of binary of 16 or 32, that hash the message and the exception, and configure an index on it. We can use HASHBYTES to help us.
I am not an expert in DB, but I think that will made the faster way to locate a similar record. And because hashing doesn't guarantee uniqueness, will help to locale those similar record much faster and later compare by message or exception directly to make sure that are unique.
This is a theoretical/practical solution, but will it work or bring more complexity? what aspects I am leaving out or what other considerations need to have? the trigger will do the job of insert or update, but is the trigger the best way to do it?
I wouldn't be too concerned with a log table of 132,000 records to be honest, I have seen millions, if not billions of records in a log table. If you are logging out 132,000 records every few minutes then you might want to tone it down a bit.
I think the idea is interesting but here is my major concerns:
You could actually hurt the performance of your application by doing this. The Log4Net ADO.NET appender is synchronous. This means if you make your INSERT anymore complicated than it needs to be (aka looking up if the data already exists, calculating hash codes etc.) you will block the thread calling logging. That's not good! You could fix this writing to some sort of a staging table and doing it out of band with a job or something but now you've created a bunch of moving parts for something that could be much simpler.
Time could probably be better spent doing other things. Storage is cheap, developer hours aren't and logs don't need to be extremely fast to access so a denormalized model should be fine.
Thoughts?
Yes you can do that. It is a good idea and it will work. Watch out for concurrency issues when inserting from multiple threads or processes. You probably need to investigate locking in detail. You should look into locking hints (in your case UPDLOCK, HOLDLOCK, ROWLOCK) and the MERGE statement. They can be used to maintain the dimension table.
As an alternative you could log to a file and compress it. Typical compression algorithms are very good at eliminating this type of exact redundancy.

Please can anyone why explain dropping and recreating stored procedures in SQL Server 2005 causes much more of an initial slow down than expected?

My first post here, please be gentle. =)
I work for a company that inherited the maintenance of a bespoke system used by one of our customers. The previous developer (no longer with us) encrypted all the database objects (WITH ENCRYPTION).
The system has been plagued with various timeout issues well before we took ownership of it, and we want to get to the bottom of these.
The database is on SQL Express 2005 in production. We want to run the profiler on it but because the various objects are encrypted, most stored procedure calls etc.. show up as '-- Encrypted Text'.
Not very useful. I've written a little console app in C# to decrypt all the database objects, which works perfectly as far as I can tell
It finds all encrypted objects in the database and for each one, decrypts it, removes the with encryption clause, drops the original and recreates it using the new 'without encryption' text.
There are some computed columns that get dropped before trying to decrypt the functions that are used in their definitions, then get recreated.
What I'm finding is that once everything is decrypted, I can't get into the system because the stored procedures etc.. take far too long to run on their first call. Execution plans are being compiled for the first time, so some delay is understandable, but we're talking 1 minute plus.. after 30 seconds the command timeout is hit, so the plans never get compiled.
I also have the same issue if I drop and recreate the database objects using their original scripts (keeping the WITH ENCRYPTION clause in).
So there's some consistency there. However, what absolutely mystifies me is that if I drop the execution plans from the original copy of the database (which was created from a backup of the production database), the same stored procedures are much faster. 10 seconds for first call. As far as I can tell, the stored procedures, functions etc.. are the same.
From my testing, I don't think it's a particular procedure or function that is causing the problem. It seems like the delay is cumulative, the more objects I drop & recreate the slower things are.
I've taken a few random stabs in the dark, rebuilding indexes and updating stats - this has had no effect at all.
We could write something to execute all 540 functions, triggers, sprocs etc.. to pre-empt the first real call from a user, however once SQL server is restarted (and our client does restart their server from time to time) the execution plans will be dropped and we'd need to run the same tool again. To me it doesn't seem a viable option (neither is increasing the CommandTimeout property), I want to know why I'm seeing this behaviour.
I've been using sys.dm_exec_query_plan and sys.dm_exec_sql_text to look at the execution plans, and using DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE as part of my testing.
I'm totally stumped, please help me before I jump out the office window.
Thanks in advance,
Andy.
--EDIT--
I don't quite know how I missed it, but the Activity Monitor is showing a session being blocked by a recompile of a table valued function. It takes far too long to compile and the blocked query hits the timeout.
I don't understand why in original version of the database (restored from backup taken from the customer site), the compilation takes around 10 seconds, but after dropping and recreating these objects in the same database, the table valued function takes almost a minute to compile.
I've tried truncating the log, which didn't have any effect. I still need to have a look at the file sizes.
-- Another edit --
The TVF returns a temporary table, and has 12 outer joins in the query, all on either sys.server_principals or sys.database_role_members.
I seem to remember reading something about recompiles and temporary tables, which I'll have to check again..
You said yourself that (computed) columns were dropped. Is it possible that other stuff was manipulated in the tables? If so, you will probably want to reindex your tables, (which will update the tables' statistics as well) using a command such as:
Exec sp_msForEachTable #COMMAND1= 'DBCC DBREINDEX ( "?")'
...though it sounds like you've done something like this. Either way, I recommend doing it once you make such a big change to all of those objects.
Third recommendation:
While you are waiting for your procs to execute, run an sp_who2 on the database to make sure nothing is blocking your queries. It's quite possible that you might have some sort of long-lived transaction happening that you haven't accounted for.
Fourth recommendation:
Make sure your server has enough memory. Make sure your transaction log files and datafiles aren't auto-growing after all of those big index and object updates. That can take FOREVER to happen, especially on under-spec'ed hardware like you may have running SQL Express.
Fifth recommendation:
Run a SQL Server Profiler trace against the database and look at what statements are starting specifically, and which are timing out. "Zoom in" on those and analyze them piece by piece and see what's up. This will likely just take a lot of good ol' hard work to fully understand.
In summary, the act of dropping and recreating procs itself shouldn't cause this slowdown if the statistics and indexes they were initially built against are sufficiently similar to what they are now. It's likely that you will find that there's Something Else happening which isn't necessarily directly related to changing the proc definitions themselves.
Another shot in the dark: Were the computed columns which you had to drop originally persisted (and not persisted after recreation) or vice versa?
If the functions called in the computation are complex or expensive, persisted columns are very advantageous and might be responsible for the behavior you are seeing.
Turns out that if I pass the parameter of the TVF to a variable, then replace where the original parameter was used, normal service is resumed (query takes less than a second, instead of a minute!)
Some kind of parameter sniffing shenanigans going on, I don't really understand why though - at the point I'm trying to call the function, no query plans exist, good or bad.
I'm in contact with Microsoft on this one (first time I've ever used my MSDN support entitlement) so hopefully we'll find out more and I'll post what I've discovered.
Thanks all for your help, we're getting there!

Refactoring "extreme" SQL queries

I have a business user who tried his hand at writing his own SQL query for a report of project statistics (e.g. number of tasks, milestones, etc.). The query starts off declaring a temp table of 80+ columns. There are then almost 70 UPDATE statements to the temp table over almost 500 lines of code that each contain their own little set of business rules. It finishes with a SELECT * from the temp table.
Due to time constraints and 'other factors', this was rushed into production and now my team is stuck with supporting it. Performance is appalling, although thanks to some tidy up it's fairly easy to read and understand (although the code smell is nasty).
What are some key areas we should be looking at to make this faster and follow good practice?
First off, if this is not causing a business problem, then leave it until it becomes a problem. Wait until it becomes a problem, then fix everything.
When you do decide to fix it, check if there is one statement causing most of your speed issues ... issolate and fix it.
If the speed issue is over all the statements, and you can combine it all into a single SELECT, this will probably save you time. I once converted a proc like this (not as many updates) to a SELECT and the time to run it went from over 3 minutes to under 3 seconds (no shit ... I couldn't believe it). By the way, don't attempt this if some of the data is coming from a linked server.
If you don't want to or can't do that for whatever reason, then you might want to adjust the existing proc. Here are some of the things I would look at:
If you are creating indexes on the temp table, wait until after your initial INSERT to populate it.
Adjust your initial INSERT to insert as many of the columns as possible. There are probably some update's you can eliminate by doing this.
Index the temp table before running your updates. Do not create indexes on any of the columns targetted by the update statements until after their updated.
Group your updates if your table(s) and groupings allow for it. 70 updates is quite a few for only 80 columns, and sounds like there may be an opportunity to do this.
Good luck
First thing I would do is check to make sure there is an active index maintenance job being run periodically. If not, get all existing indexes rebuilt or if not possible at least get statistics updated.
Second thing I would do is set up a trace (as described here) and find out which statements are causing the highest number of reads.
Then I would run in SSMS with 'show actual execution plan' and tally the results with the trace. From this you should be able to work out whether there are missing indexes that could improve performance.
EDIT: If you are going to downvote, please leave a comment as to why.
Just like any refactoring, make sure you have an automated way to verify your refactorings after each change (you can write this yourself using queries which check the development output against a known good baseline). That way, you are always matching the known good data. This will give you a high degree of confidence in the correctness of your approach when you enter the phase where you are deciding whether to switch over to your new version of the process and want to run side by side for a few iterations to ensure correctness.
I also like to log all the test batches and the run times of the processes within the batch, so I can tell if some particular process within the batch was adversely affected at some point in time. I can get average times for processes and see trends of improvement or spot potential problems. This also lets me identify the low-hanging fruit within the batch where I can make the most improvement.
There are then almost 70 UPDATE
statements to the temp table over
almost 500 lines of code that each
contain their own little set of
business rules. It finishes with a
SELECT * from the temp table.
Actually this sounds like it can be followed and understood quite well, each update statement does one thing to the table with a specific purpose and set of business rules. I think that maintaining procedures of 500 lines of code with one or a couple of select statements that does "everything", built with 15 or so joins, and case statements etc scattered all over the place, is a lot harder to maintain. Although it would make for better performance..
It's a bit of a dilemma with SQL, that writing clear and concise code (using multiple updates, creating functions etc) always seems to have a big negative impact on performance. Trying to do everything at once, which is considered bad practice in other programming languages, seems to be the very core of set oriented languages.
If this is a report generating stored procedure, how often is it being run? If it's only necessary to run it once a day and is run during the night how much of an issue is the performance?
If it's not I'd recommend being careful in your choice to re-write it because there is a chance that you could muck up your figures.
Also it sounds like the sort of thing that should be pulled out into an SSIS package building up a new permanent table with the results so it only has to be run once.
Hope this makes sense
One thing you could try is to replace the temp table with a table variable. There are times when this is faster and times when it is not, you will have to just try it and see.
Look at the 70 update statements. It is possible to combine any of them? If the person writing did not use CASE statments, it might be possible to do fewer statements.
Other obvious things to look at - eliminate any cursors, change any subqueries to joins to tables or derived tables.
Rewrite perhaps. One hardware solution would be to make sure your database temp table goes on a 'fast' drive, perhaps a solid state disk (SSD), or can be managed all in memory.
My guess is this 'solution' was developed by someone with a grasp of and a dependency upon spreadsheets, someone who may not be very savvy on 'normalized' databases--how to construct and populate tables to retain data for reporting purposes, something which perhaps BI Business Intelligence software can be utilized with sophistication and yet be adaptable.
You didn't say 'where' the update process is being run. Is the update process being run as a SQL script from a separate computer (desktop) against the server where the data is? There can be significant bottlenecks and overhead created by that approach. If so, consider running the entire update process directly on the server as a local job, as a compiled stored procedure, bypassing the network and (multiple) cursor management overhead. It could have a scheduled time to run and a controlled priority, completing in off peak business data usage hours.
Evaluate how often 'commit' statements are really needed for the sequence of update statements...saving on a bunch of commit lines could notably improve the overall update time. There may be a couple of settings in the database client driver software which can make a notable difference.
Can the queries used for update conditions be factored out as static 'views' which in turn can be shared across multiple update statements? Views can keep in memory data/query rows frequently accessed. There may be performance tuning in determining how much update data can be pended before a commit is optimal.
It might be worth evaluating whether Triggers could be used to replace the batch job update sequence. You don't say from how many tables the data used comes from...that might help with decision making. I don't know if you have the option of adding triggers to the database tables from which the data is gathered. If so, adding a few triggers to a number of tables wouldn't really degrade overall system performance much, but might save a big wad of time on that update process. You could try replacing the update statements one at a time with triggers and see if the results are the same as before. Create a similar temp table, based on the same update process, then carefully test whether triggers feeding updates to the temp table could replace individual update statements. Perhaps you may have a sort of 'Data Warehouse' application. It might be useful to review how to set up a 'star' schema of tables to retain summarized business data for reporting.
Creating a comprehensive and cached 'view' which updates via the queries once per day, reflecting the updates might be another approach to explore.
Well, since the only thing you've told us about this stored procedure is that it has a 80+ column temp table, the only thing I can recommend is to remove that table, and rewrite the rest to remove the need for it.
You should get a tool that allows you to get an explain plan of all queries your app will run. It is the best bang for the buck on an SQL heavy app for performace increases. If you read and react to what the Explain Plan is telling you. If you are on Oracle what we used to use was TOAD by Qwest(?) I think. It was a great tool.
I would recommend looking at the tables involved, the end result, and starting from scratch to see if the query can be done in a more efficient manner. Keep the query to verify that the new one is working exactly the same as the old one, but try to forget all methods used to obtain the end result.
I would rewrite it from scratch.
You say that you understand what it supposed to do so it should not be that difficult. And I bet that the requirements for that piece of code will keep changing so if you do not rewrite it now you may end up maintaining some ugly monster

Effectively transforming data from one SQL database to another in live environment

We have a bit of a messy database situation.
Our main back-office system is written in Visual Fox Pro with local data (yes, I know!)
In order to effectively work with the data in our websites, we have chosen to regularly export data to a SQL database. However the process that does this basically clears out the tables each time and does a re-insert.
This means we have two SQL databases - one that our FoxPro export process writes to, and another that our websites read from.
This question is concerned with the transform from one SQL database to the other (SqlFoxProData -> SqlWebData).
For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)
This has been working fine for many months but now we are starting to hit upon performance problems when an update is taking place (this can happen regularly during the day)
Basically when we are updating SqlWebData.ImportantTable from SqlFoxProData.ImportantTable, it's causing occasional connection timeouts/deadlocks/other problems on the live websites.
I've worked hard at optimising queries, caching etc etc, but it's come to a point where I'm looking for another strategy to update the data.
One idea that has come to mind is to have two copies of ImportantTable (A and B), some concept of which table is currently 'active', updating the non-active table, then switching the currenly actice table
i.e. websites read from ImportantTableA whilst we're updating ImportantTableB, then we switch websites to read from ImportantTableB.
Question is, is this feasible and a good idea? I have done something like it before but I'm not convinced it's necessarily good for optimisation/indexing etc.
Any suggestions welcome, I know this is a messy situation... and the long term goal would be to get our FoxPro application pointing to SQL.
(We're using SQL 2005 if it helps)
I should add that data consistency isn't particularly important in the instance, seeing as the data is always slightly out of date
There are a lot of ways to skin this cat.
I would attack the locking issues first. It is extremely rare that I would use CURSORS, and I think improving the performance and locking behavior there might resolve a lot of your issues.
I expect that I would solve it by using two separate staging tables. One for the FoxPro export in SQL and one transformed into the final format in SQL side-by-side. Then either swapping the final for production using sp_rename, or simply using 3 INSERT/UPDATE/DELETE transactions to apply all changes from the final table to production. Either way, there is going to be some locking there, but how big are we talking about?
You should be able to maintain one db for the website and just replicate to that table from the other sql db table.
This is assuming that you do not update any data from the website itself.
"For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)"
I cannot think of a case where I would ever need to perform an insert, update or delete using a cursor. If you can write the select for the cursor, you can convert it into an insert, update or delete. You can join to other tables in these statements and use the case stament for conditional processing. Taking the time to do this in a set -based fashion may solve your problem.
One thing you may consider if you have lots of data to move. We occassionally create a view to the data we want and then have two tables - one active and one that data will be loaded into. When the data is finsihed loading, as part of your process run a simple command to switch the table the view uses to the one you just finshed loading to. That way the users are only down for a couple of seconds at most. You won't create locking issues where they are trying to access data as you are loading.
You might also look at using SSIS to move the data.
Do you have the option of making the updates more atomic, rather than the stated 'clear out and re-insert'? I think Visual Fox Pro supports triggers, right? For your key tables, can you add a trigger to the update/insert/delete to capture the ID of records that change, then move (or delete) just those records?
Or how about writing all changes to an offline database, and letting SQL Server replication take care of the sync?
[sorry, this would have been a comment, if I had enough reputation!]
Based on your response to Ernie above, you asked how you replicate databases. Here is Microsoft's how-to about replication in SQL2005.
However, if you're asking about replication and how to do it, it indicates to me that you are a little light in experience for SQL server. That being said, it's fairly easy to muck things up and while I'm all for learning by experience, if this is mission critical data, you might be better off hiring a DBA or at the very least, testing the #$##$% out of this before you actually implement it.