Refactoring "extreme" SQL queries - sql

I have a business user who tried his hand at writing his own SQL query for a report of project statistics (e.g. number of tasks, milestones, etc.). The query starts off declaring a temp table of 80+ columns. There are then almost 70 UPDATE statements to the temp table over almost 500 lines of code that each contain their own little set of business rules. It finishes with a SELECT * from the temp table.
Due to time constraints and 'other factors', this was rushed into production and now my team is stuck with supporting it. Performance is appalling, although thanks to some tidy up it's fairly easy to read and understand (although the code smell is nasty).
What are some key areas we should be looking at to make this faster and follow good practice?

First off, if this is not causing a business problem, then leave it until it becomes a problem. Wait until it becomes a problem, then fix everything.
When you do decide to fix it, check if there is one statement causing most of your speed issues ... issolate and fix it.
If the speed issue is over all the statements, and you can combine it all into a single SELECT, this will probably save you time. I once converted a proc like this (not as many updates) to a SELECT and the time to run it went from over 3 minutes to under 3 seconds (no shit ... I couldn't believe it). By the way, don't attempt this if some of the data is coming from a linked server.
If you don't want to or can't do that for whatever reason, then you might want to adjust the existing proc. Here are some of the things I would look at:
If you are creating indexes on the temp table, wait until after your initial INSERT to populate it.
Adjust your initial INSERT to insert as many of the columns as possible. There are probably some update's you can eliminate by doing this.
Index the temp table before running your updates. Do not create indexes on any of the columns targetted by the update statements until after their updated.
Group your updates if your table(s) and groupings allow for it. 70 updates is quite a few for only 80 columns, and sounds like there may be an opportunity to do this.
Good luck

First thing I would do is check to make sure there is an active index maintenance job being run periodically. If not, get all existing indexes rebuilt or if not possible at least get statistics updated.
Second thing I would do is set up a trace (as described here) and find out which statements are causing the highest number of reads.
Then I would run in SSMS with 'show actual execution plan' and tally the results with the trace. From this you should be able to work out whether there are missing indexes that could improve performance.
EDIT: If you are going to downvote, please leave a comment as to why.

Just like any refactoring, make sure you have an automated way to verify your refactorings after each change (you can write this yourself using queries which check the development output against a known good baseline). That way, you are always matching the known good data. This will give you a high degree of confidence in the correctness of your approach when you enter the phase where you are deciding whether to switch over to your new version of the process and want to run side by side for a few iterations to ensure correctness.
I also like to log all the test batches and the run times of the processes within the batch, so I can tell if some particular process within the batch was adversely affected at some point in time. I can get average times for processes and see trends of improvement or spot potential problems. This also lets me identify the low-hanging fruit within the batch where I can make the most improvement.

There are then almost 70 UPDATE
statements to the temp table over
almost 500 lines of code that each
contain their own little set of
business rules. It finishes with a
SELECT * from the temp table.
Actually this sounds like it can be followed and understood quite well, each update statement does one thing to the table with a specific purpose and set of business rules. I think that maintaining procedures of 500 lines of code with one or a couple of select statements that does "everything", built with 15 or so joins, and case statements etc scattered all over the place, is a lot harder to maintain. Although it would make for better performance..
It's a bit of a dilemma with SQL, that writing clear and concise code (using multiple updates, creating functions etc) always seems to have a big negative impact on performance. Trying to do everything at once, which is considered bad practice in other programming languages, seems to be the very core of set oriented languages.

If this is a report generating stored procedure, how often is it being run? If it's only necessary to run it once a day and is run during the night how much of an issue is the performance?
If it's not I'd recommend being careful in your choice to re-write it because there is a chance that you could muck up your figures.
Also it sounds like the sort of thing that should be pulled out into an SSIS package building up a new permanent table with the results so it only has to be run once.
Hope this makes sense

One thing you could try is to replace the temp table with a table variable. There are times when this is faster and times when it is not, you will have to just try it and see.
Look at the 70 update statements. It is possible to combine any of them? If the person writing did not use CASE statments, it might be possible to do fewer statements.
Other obvious things to look at - eliminate any cursors, change any subqueries to joins to tables or derived tables.

Rewrite perhaps. One hardware solution would be to make sure your database temp table goes on a 'fast' drive, perhaps a solid state disk (SSD), or can be managed all in memory.
My guess is this 'solution' was developed by someone with a grasp of and a dependency upon spreadsheets, someone who may not be very savvy on 'normalized' databases--how to construct and populate tables to retain data for reporting purposes, something which perhaps BI Business Intelligence software can be utilized with sophistication and yet be adaptable.
You didn't say 'where' the update process is being run. Is the update process being run as a SQL script from a separate computer (desktop) against the server where the data is? There can be significant bottlenecks and overhead created by that approach. If so, consider running the entire update process directly on the server as a local job, as a compiled stored procedure, bypassing the network and (multiple) cursor management overhead. It could have a scheduled time to run and a controlled priority, completing in off peak business data usage hours.
Evaluate how often 'commit' statements are really needed for the sequence of update statements...saving on a bunch of commit lines could notably improve the overall update time. There may be a couple of settings in the database client driver software which can make a notable difference.
Can the queries used for update conditions be factored out as static 'views' which in turn can be shared across multiple update statements? Views can keep in memory data/query rows frequently accessed. There may be performance tuning in determining how much update data can be pended before a commit is optimal.
It might be worth evaluating whether Triggers could be used to replace the batch job update sequence. You don't say from how many tables the data used comes from...that might help with decision making. I don't know if you have the option of adding triggers to the database tables from which the data is gathered. If so, adding a few triggers to a number of tables wouldn't really degrade overall system performance much, but might save a big wad of time on that update process. You could try replacing the update statements one at a time with triggers and see if the results are the same as before. Create a similar temp table, based on the same update process, then carefully test whether triggers feeding updates to the temp table could replace individual update statements. Perhaps you may have a sort of 'Data Warehouse' application. It might be useful to review how to set up a 'star' schema of tables to retain summarized business data for reporting.
Creating a comprehensive and cached 'view' which updates via the queries once per day, reflecting the updates might be another approach to explore.

Well, since the only thing you've told us about this stored procedure is that it has a 80+ column temp table, the only thing I can recommend is to remove that table, and rewrite the rest to remove the need for it.

You should get a tool that allows you to get an explain plan of all queries your app will run. It is the best bang for the buck on an SQL heavy app for performace increases. If you read and react to what the Explain Plan is telling you. If you are on Oracle what we used to use was TOAD by Qwest(?) I think. It was a great tool.

I would recommend looking at the tables involved, the end result, and starting from scratch to see if the query can be done in a more efficient manner. Keep the query to verify that the new one is working exactly the same as the old one, but try to forget all methods used to obtain the end result.

I would rewrite it from scratch.
You say that you understand what it supposed to do so it should not be that difficult. And I bet that the requirements for that piece of code will keep changing so if you do not rewrite it now you may end up maintaining some ugly monster

Related

Optimizing MSSQL for High Volume Single Record Inserts

We have a C# application that receives a file each day with ~35,000,000 rows. It opens the file, parses each record individually, formats some of the fields and then inserts one record at a time into a table. It's really slow, which is expected, but I've been asked to optimize it.
I have instructed that that any optimizations must be contained to SQL only. i.e., there can be no changes to the process or the C# code. I'm trying tom come up with ideas on how I can speed up this process while being limited to SQL modifications only. I have a couple of ideas I want to try but I'd also like feedback from anyone who has found themselves in this situation before.
Ideas:
1. Create a clustered index on the table so the insert always occurs at the tale end of the file. The records in the file are ordered by date/time and the current table has no clustered index so this seems like a valid approach.
Somehow reduce the logging overheard. This data is volatile in nature so it's not a big deal to be able to rollback. Even if the process blew up halfway through, they would just restart it.
Change the isolation level. Perhaps there is an isolation level that is more suited for sequential single-record inserts.
Reduce connection time. The C# app is opening/closing a connection for each insert. We can't change the C# code though so perhaps there is a trick to reducing overhead/time to make a connection.
I appreciate anyone taking the time to read my post and throw out any ideas they feel would be worth it.
Thanks,
Dean
I would suggest the following -- if possible.
Load the data into a staging table.
Do the transformations in SQL.
Bulk insert the data into the final table.
The second suggestion would be:
Modify the C# code to write the data into a file.
Bulk insert the file, either into a staging table or the final table.
Unfortunately, your problem is 35 million round trips from C# to the database. I doubt there is any database optimization that can fix that performance problem. In other words, you need to change the C# code to fix the performance issue. Anything else is probably just a waste of your time.
You can minimize logging either by using simple recovery or writing to a temporary table. Either of those might help. However, consider the second option, because it would be a minor change to the C# code and could result in big improvements.
Or, if you have to do the best in a really bad situation:
Run the C# code and database on the same server. Be sure it has lots of processors.
Attach lots of SSD or memory for the database (if you are not already using it).
Load the data into table spaces that are only on SSD or in memory.
Copy the data from the local database to the remote one.

Managing very large SQL queries

I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it

How can my application benefit from temporary tables?

I've been reading a little about temporary tables in MySQL but I'm an admitted newbie when it comes to databases in general and MySQL in particular. I've looked at some examples and the MySQL documentation on how to create a temporary table, but I'm trying to determine just how temporary tables might benefit my applications and I guess secondly what sorts of issues I can run into. Granted, each situation is different, but I guess what I'm looking for is some general advice on the topic.
I did a little googling but didn't find exactly what I was looking for on the topic. If you have any experience with this, I'd love to hear about it.
Thanks,
Matt
Temporary tables are often valuable when you have a fairly complicated SELECT you want to perform and then perform a bunch of queries on that...
You can do something like:
CREATE TEMPORARY TABLE myTopCustomers
SELECT customers.*,count(*) num from customers join purchases using(customerID)
join items using(itemID) GROUP BY customers.ID HAVING num > 10;
And then do a bunch of queries against myTopCustomers without having to do the joins to purchases and items on each query. Then when your application no longer needs the database handle, no cleanup needs to be done.
Almost always you'll see temporary tables used for derived tables that were expensive to create.
First a disclaimer - my job is reporting so I wind up with far more complex queries than any normal developer would. If you're writing a simple CRUD (Create Read Update Delete) application (this would be most web applications) then you really don't want to write complex queries, and you are probably doing something wrong if you need to create temporary tables.
That said, I use temporary tables in Postgres for a number of purposes, and most will translate to MySQL. I use them to break up complex queries into a series of individually understandable pieces. I use them for consistency - by generating a complex report through a series of queries, and I can then offload some of those queries into modules I use in multiple places, I can make sure that different reports are consistent with each other. (And make sure that if I need to fix something, I only need to fix it once.) And, rarely, I deliberately use them to force a specific query plan. (Don't try this unless you really understand what you are doing!)
So I think temp tables are great. But that said, it is very important for you to understand that databases generally come in two flavors. The first is optimized for pumping out lots of small transactions, and the other is optimized for pumping out a smaller number of complex reports. The two types need to be tuned differently, and a complex report run on a transactional database runs the risk of blocking transactions (and therefore making web pages not return quickly). Therefore you generally don't want to avoid using one database for both purposes.
My guess is that you're writing a web application that needs a transactional database. In that case, you shouldn't use temp tables. And if you do need complex reports generated from your transactional data, a recommended best practice is to take regular (eg daily) backups, restore them on another machine, then run reports against that machine.
The best place to use temporary tables is when you need to pull a bunch of data from multiple tables, do some work on that data, and then combine everything to one result set.
In MS SQL, Temporary tables should also be used in place of cursors whenever possible because of the speed and resource impact associated with cursors.
If you are new to databases, there are some good books by Joe Kelko that review best practices for ANSI SQL. SQL For Smarties will describe in great detail the use of temp table, impact of indexes, where clauses, etc. It's a great reference book with in depth detail.
I've used them in the past when I needed to create evaluated data. That was before the time of views and sub selects in MySQL though and I generally use those now where I would have needed a temporary table. The only time I might use them is if the evaluated data took a long time to create.
I haven't done them in MySQL, but I've done them on other databases (Oracle, SQL Server, etc).
Among other tasks, temporary tables provide a way for you to create a queryable (and returnable, say from a sproc) dataset that's purpose-built. Let's say you have several tables of figures -- you can use a temporary table to roll those figures up to nice, clean totals (or other math), then join that temp table to others in your schema for final output. (An example of this, in one of my projects, is calculating how many scheduled calls a given sales-related employee must make per week, bi-weekly, monthly, etc.)
I also often use them as a means of "tilting" the data -- turning columns to rows, etc. They're good for advanced data processing -- but only use them when you need to. (My golden rule, as always, applies: If you don't know why you're using x, and you don't know how x works, then you probably shouldn't use it.)
Generally, I wind up using them most in sprocs, where complex data processing is needed. I'd love to give a concrete example, but mine would be in T-SQL (as opposed to MySQL's more standard SQL), and also they're all client/production code which I can't share. I'm sure someone else here on SO will pick up and provide some genuine sample code; this was just to help you get the gist of what problem domain temp tables address.

Effectively transforming data from one SQL database to another in live environment

We have a bit of a messy database situation.
Our main back-office system is written in Visual Fox Pro with local data (yes, I know!)
In order to effectively work with the data in our websites, we have chosen to regularly export data to a SQL database. However the process that does this basically clears out the tables each time and does a re-insert.
This means we have two SQL databases - one that our FoxPro export process writes to, and another that our websites read from.
This question is concerned with the transform from one SQL database to the other (SqlFoxProData -> SqlWebData).
For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)
This has been working fine for many months but now we are starting to hit upon performance problems when an update is taking place (this can happen regularly during the day)
Basically when we are updating SqlWebData.ImportantTable from SqlFoxProData.ImportantTable, it's causing occasional connection timeouts/deadlocks/other problems on the live websites.
I've worked hard at optimising queries, caching etc etc, but it's come to a point where I'm looking for another strategy to update the data.
One idea that has come to mind is to have two copies of ImportantTable (A and B), some concept of which table is currently 'active', updating the non-active table, then switching the currenly actice table
i.e. websites read from ImportantTableA whilst we're updating ImportantTableB, then we switch websites to read from ImportantTableB.
Question is, is this feasible and a good idea? I have done something like it before but I'm not convinced it's necessarily good for optimisation/indexing etc.
Any suggestions welcome, I know this is a messy situation... and the long term goal would be to get our FoxPro application pointing to SQL.
(We're using SQL 2005 if it helps)
I should add that data consistency isn't particularly important in the instance, seeing as the data is always slightly out of date
There are a lot of ways to skin this cat.
I would attack the locking issues first. It is extremely rare that I would use CURSORS, and I think improving the performance and locking behavior there might resolve a lot of your issues.
I expect that I would solve it by using two separate staging tables. One for the FoxPro export in SQL and one transformed into the final format in SQL side-by-side. Then either swapping the final for production using sp_rename, or simply using 3 INSERT/UPDATE/DELETE transactions to apply all changes from the final table to production. Either way, there is going to be some locking there, but how big are we talking about?
You should be able to maintain one db for the website and just replicate to that table from the other sql db table.
This is assuming that you do not update any data from the website itself.
"For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)"
I cannot think of a case where I would ever need to perform an insert, update or delete using a cursor. If you can write the select for the cursor, you can convert it into an insert, update or delete. You can join to other tables in these statements and use the case stament for conditional processing. Taking the time to do this in a set -based fashion may solve your problem.
One thing you may consider if you have lots of data to move. We occassionally create a view to the data we want and then have two tables - one active and one that data will be loaded into. When the data is finsihed loading, as part of your process run a simple command to switch the table the view uses to the one you just finshed loading to. That way the users are only down for a couple of seconds at most. You won't create locking issues where they are trying to access data as you are loading.
You might also look at using SSIS to move the data.
Do you have the option of making the updates more atomic, rather than the stated 'clear out and re-insert'? I think Visual Fox Pro supports triggers, right? For your key tables, can you add a trigger to the update/insert/delete to capture the ID of records that change, then move (or delete) just those records?
Or how about writing all changes to an offline database, and letting SQL Server replication take care of the sync?
[sorry, this would have been a comment, if I had enough reputation!]
Based on your response to Ernie above, you asked how you replicate databases. Here is Microsoft's how-to about replication in SQL2005.
However, if you're asking about replication and how to do it, it indicates to me that you are a little light in experience for SQL server. That being said, it's fairly easy to muck things up and while I'm all for learning by experience, if this is mission critical data, you might be better off hiring a DBA or at the very least, testing the #$##$% out of this before you actually implement it.