SQL Queries - How Slow is Too Slow? - sql

Do you have any formal or informal standards for reasonably achievable SQL query speed? How do you enforce them? Assume a production OLTP database under full realistic production load of a couple dozen queries per second, properly equipped and configured.
Personal example for illustrative purposes (not a recommendation, highly contingent on many factors, some outside your control):
Expectation:
Each transactional unit (single statement, multiple SQL statements from beginning to end transaction boundaries, or a single stored procedure, whichever is largest) must execute in 1 second or less on average, without anomalous outliers.
Resolution:
Slower queries must be optimized to standard. Slow queries for reports and other analysis are moved to an OLAP cube (best case) or a static snapshot database.
(Obviously some execution queries (Insert/Update/Delete) can't be moved, so must be optimized, but so far in my experience it's been achievable.)

Given that you can't expect deterministic performance on a system that could (at least in theory) be subject to transient load spikes, you want your performance SLA to be probabilistic. An example of this might be:
95% of transactions to complete within 2 seconds.
95% of search queries (more appropriate for a search screen) to complete within 10 seconds.
95% of operational reports to complete within 10 seconds.
Transactional and search queries can't be moved off transactional system, so the only actions you can take are database or application tuning, or buying faster hardware.
For operational reports, you need to be ruthless about what qualifies as an operational report. Only reports that absolutely need to have access to up-to-date data should be run off the live system. Reports that do a lot of I/O are very anti-social on a production system, and normalised schemas tend to be quite inefficient for reporting. Move any reports that do not require real-time data off onto a data warehouse or some other separate reporting facility.

I usually go by the one second rule when writing/refactoring stored procedures, although my workplace doesn't have any specific rules about this. It's just my common sense. Experience tells me that if it takes up to ten seconds or more for a procedure to execute, which doesn't perform any large bulk inserts, there are usually serious problems in the code that can easily be corrected.
They way most common problem I encounter in SP:s with poor performance is incorrect use of indexes, causing costly index seek operations.

O of N is good and anything worse like N^2 will eventually be too slow.

Related

Oracle- Does executing simultaneous queries affect their performances?

I am using very large tables containing hundreds of millions of rows, and I am measuring the performances of some queries using SQL Developer, I figured out that there is an option called Unshared SQL worksheet, it allows me to execute many queries at the same time. Executing many queries at the same time is suitable for me especially that some queries or procedure take hours to be executed.
My question is does executing many queries at the same time affect performances? (by performances I mean the duration of execution of queries)
Every query costs something to execute. That's just physics. Your database has a fixed amount of resources - CPU, memory, I/O, temp disk - to service queries (let's leave elastic cloud instances out of the picture). Every query which is executing simultaneously is asking for resources from that fixed pot. Potentially, if you run too many queries at the same time you will run into resource contention, which will affect the performance of individual queries.
Note the word "potentially". Whether you will run into actual problem depends on many things: what resources your queries need, how efficiently your queries have been written, how much resource your database server has available, how efficiently it's been configured to support multiple users (and whether the DBA has implemented profiles to manage resource usage). So, like with almost every database tuning question, the answer is "it depends".
This is true even for queries which hit massive tables such as you describe. Although, if you have queries which you know will take hours to run you might wish to consider tuning them as a matter of priority.

Is it possible to reject excessively large queries on specific views?

I'm working with MS-SQL Server, and we have several views that have the potential to return enormous amounts of processed data, enough to spike our servers to 100% resource usage for 30 minutes straight with a single query (if queried irresponsibly).
There is absolutely no business case in which such huge amounts of data would need to be returned from these views, so we'd like to lock it down to make sure nobody can DoS our SQL servers (intentionally or otherwise) by simply querying these particular views without proper where clauses etc.
Is it possible, via triggers or another method, to check the where clause etc. and confirm whether a given query is "safe" to execute (based on thresholds we determine), and reject the query if it doesn't meet our guidelines?
Or can we configure the server to reject given execution plans based on estimated time-to-completion etc.?
One potential way to reduce the overall cost of certain queries coming from a certain group of people is to use the resource governor. You can throttle how much CPU and/or memory is used up be a particular user/group. This is effective if you have a "wild west" kind of environment where some users submit bad queries that eat your resources alive. See here.
Another thing to consider is to set your MAXDOP (max degree of parallelism) to prevent any single query from taking all of the available CPU threads. That is, if MAXDOP is 1, then any query can only take 2 CPU threads to process. This is useful to prevent a large query from letting smaller quick ones processing. See here.
Kind of hacky but put a top x in every view
You cannot enforce it at the SQL side but on the app size they could use a TimeOut. But if they lack QC they probably lack the discipline for TimeOut. If you have some queries going 30 minutes they are probably setting a value longer than the default.
I'm not convinced about Blam's top X in each view. Without a corresponding ORDER BY clause the data will be returned in an indeterminate order. There may benefits to CDC's MAXDOP suggestion. Not so much for itself, but for the other queries that want to run at the same time.
I'd be inclined to look at moving to stored procedures. Then you can require input parameters and evaluate them before the query gets run in earnest. If, for example, a date range is too big, you can restrict it. You should also find out who is running the expensive query and what they really need. Seems like they might benefit from some ETL. Just some ideas.

Minimize SQL Server stress on queries from a read only schema

I want to make sure the stress to the server is minimal while running queries from a read only schema (a user can select data and create temp tables and variables, but can't execute SPs, write and other more advanced stuff). What db hints/other tricks could I use in this situation?
Currently I am:
Using the WITH (NOLOCK) hint for every table
Setting the DEADLOCK_PRIORITY for the whole batch to -10 (although I am not sure it's really needed, since I am using NOLOCK)
My goals is to take as little server resources as possible and allow other more important things to be processed by the server freely. The queries that I am going to send to the server are local (can't be saved as SPs) and there will be many of them coming from various users every 5 minutes. They are generally simple SELECTs and are cheap in isolation. Are there any other ways to make them even less expensive?
EDIT:
I am not the owner of the server I am connecting to, so I can only use the SQL query I am passing to the server to achieve what I want.
The two measures you have taken will have little impact. They are mostly used out of superstitiousness. They can have an impact in rare cases. Practically, READ UNCOMMITTED (which is 100% identical to NOLOCK) enables allocation order scans on B-trees. That is only important for tables that are not in-memory anyway.
If you want to minimize locking and blocking, snapshot isolation can be a simple and very effective solution.
In order to truly minimize the impact of a certain workload you need to use Resource Governor. Everything else are partial solutions/workarounds.
Consider limiting CPU usage, memory, IO and parallelism.

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).
The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.
Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.
My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.
Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?
I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.
The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.
I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.
Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.
You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.
The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.
So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.
90 million rows should be about 90GB, thus your bottleneck is disk.
If you need these queries rarely, run them as is.
If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).
For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.
Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.
It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.
If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL
I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.
In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.
Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.
The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.
Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.