I'm working with MS-SQL Server, and we have several views that have the potential to return enormous amounts of processed data, enough to spike our servers to 100% resource usage for 30 minutes straight with a single query (if queried irresponsibly).
There is absolutely no business case in which such huge amounts of data would need to be returned from these views, so we'd like to lock it down to make sure nobody can DoS our SQL servers (intentionally or otherwise) by simply querying these particular views without proper where clauses etc.
Is it possible, via triggers or another method, to check the where clause etc. and confirm whether a given query is "safe" to execute (based on thresholds we determine), and reject the query if it doesn't meet our guidelines?
Or can we configure the server to reject given execution plans based on estimated time-to-completion etc.?
One potential way to reduce the overall cost of certain queries coming from a certain group of people is to use the resource governor. You can throttle how much CPU and/or memory is used up be a particular user/group. This is effective if you have a "wild west" kind of environment where some users submit bad queries that eat your resources alive. See here.
Another thing to consider is to set your MAXDOP (max degree of parallelism) to prevent any single query from taking all of the available CPU threads. That is, if MAXDOP is 1, then any query can only take 2 CPU threads to process. This is useful to prevent a large query from letting smaller quick ones processing. See here.
Kind of hacky but put a top x in every view
You cannot enforce it at the SQL side but on the app size they could use a TimeOut. But if they lack QC they probably lack the discipline for TimeOut. If you have some queries going 30 minutes they are probably setting a value longer than the default.
I'm not convinced about Blam's top X in each view. Without a corresponding ORDER BY clause the data will be returned in an indeterminate order. There may benefits to CDC's MAXDOP suggestion. Not so much for itself, but for the other queries that want to run at the same time.
I'd be inclined to look at moving to stored procedures. Then you can require input parameters and evaluate them before the query gets run in earnest. If, for example, a date range is too big, you can restrict it. You should also find out who is running the expensive query and what they really need. Seems like they might benefit from some ETL. Just some ideas.
Related
I'm writing a RESTful API which queries an RDBMS. The first time the SQL query in question is run, it's taking around 10 seconds. Subsequent runs of the query with the same bind variables take milliseconds. I believe this is due to the results being present in the buffer cache.
I don't want to store the results of this query in the application layer due to space considerations and would rather run it every time I need it. For performance reasons, I'm relying on the fact that the results will be retrieved through a buffer cache hit. Subsequent query runs will be seconds or minutes apart and it appears that the results stay present in the buffer cache for longer than this.
Is it a reasonable design decision to leverage the buffer cache in this manner?
I would suggest wrapping the query logic at the application layer in some sort of function. Then, implement the function using a query, which can rely on the query cache.
In the future, you may want to change the implementation to locally cache the results. Encapsulating the functionality makes it easy to change the implementation, without affecting other code.
Is this a reasonable design decision?
I would suggest you'd want to do some fairly extensive load and performance testing to answer that question.
I am a pessimist, but when you have queries that take 10 seconds, it's not unusual for them to slow down over time (you're probably not hitting an index, so as the data grows, the performance gets worse). Those queries also tend to struggle with contention for resources, so performance degrades as the number of concurrent queries increases.
I'd also want to look at which business events would cause a cache refresh in Oracle, and their frequency. If you're querying orders in a web shop for instance, you may get new data every second, so your cache will not have a long time-to-live.
If you rely on the buffer cache to keep the application performant, I'd want to understand exactly what happens when the cache is missed, how many concurrent queries you can handle, and what happens when 10 seconds becomes 20 or 30 or 60.
I want to make sure the stress to the server is minimal while running queries from a read only schema (a user can select data and create temp tables and variables, but can't execute SPs, write and other more advanced stuff). What db hints/other tricks could I use in this situation?
Currently I am:
Using the WITH (NOLOCK) hint for every table
Setting the DEADLOCK_PRIORITY for the whole batch to -10 (although I am not sure it's really needed, since I am using NOLOCK)
My goals is to take as little server resources as possible and allow other more important things to be processed by the server freely. The queries that I am going to send to the server are local (can't be saved as SPs) and there will be many of them coming from various users every 5 minutes. They are generally simple SELECTs and are cheap in isolation. Are there any other ways to make them even less expensive?
EDIT:
I am not the owner of the server I am connecting to, so I can only use the SQL query I am passing to the server to achieve what I want.
The two measures you have taken will have little impact. They are mostly used out of superstitiousness. They can have an impact in rare cases. Practically, READ UNCOMMITTED (which is 100% identical to NOLOCK) enables allocation order scans on B-trees. That is only important for tables that are not in-memory anyway.
If you want to minimize locking and blocking, snapshot isolation can be a simple and very effective solution.
In order to truly minimize the impact of a certain workload you need to use Resource Governor. Everything else are partial solutions/workarounds.
Consider limiting CPU usage, memory, IO and parallelism.
If a query is constantly sent to a database at short intervals, say every 5 seconds, could the number of reads generated cause problems in terms of performance or availability? If the database is Oracle are there any tricks that can be used to avoid a performance hit? If the queries are coming from an application is there a way to reduce any impact through software design?
Unless your query is very intensive or horribly written then it won't cause any noticeable issues running once every few seconds. That's not very often for queries that are generally measured in milliseconds.
You may still want to optimize it though, simply because there are better ways to do it. In Oracle and ADO.NET you can use an OracleDependency for the command that ran the query the first time and then subscribe to its OnChange event which will get called automatically whenever the underlying data would cause the query results to change.
It depends on the query. I assume the reason you want to execute it periodically is because the data being returned will changed frequently. If that's the case, then application level caching is obviously not an option.
Past that, is this query "big" in terms of the number of rows returned, tables joined, data aggregated / calculated? If so, it could be a problem if:
You are querying faster than it takes to execute the query. If you are calling it once a second, but it takes 2 seconds to run, that's going to become a problem.
If the query is touching a lot of data and you have a lot of other queries accessing the same tables, you could run into lock escalation issues.
As with most performance questions, the only real answer is to test. In this case test with realistic data in the DB and run this query concurrent with the other query load you expect on the system.
Along the lines of Samuel's suggestion, Oracle provides facilities in JDBC to do database change notification so that your application can subscribe to changes in the underlying data rather than re-running the query every few seconds. If the data is changing less frequently than you're running the query, this can be a major performance benefit.
Another option would be to use Oracle TimesTen as an in memory cache of the data on the middle tier machine(s). That will reduce the network round-trips and it will go through a very optimized retrieval path.
Finally, I'd take a look at using the query result cache to have Oracle cache the results.
I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.
I have a stored proc that processes a large amount of data (about 5m rows in this example). The performance varies wildly. I've had the process running in as little as 15 minutes and seen it run for as long as 4 hours.
For maintenance, and in order to verify that the logic and processing is correct, we have the SP broken up into sections:
TRUNCATE and populate a work table (indexed) we can verify later with automated testing tools.
Join several tables together (including some of these work tables) to product another work table
Repeat 1 and/or 2 until a final output is produced.
My concern is that this is a single SP and so gets an execution plan when it is first run (even WITH RECOMPILE). But at that time, the work tables (permanent tables in a Work schema) are empty.
I am concerned that, regardless of the indexing scheme, the execution plan will be poor.
I am considering breaking up the SP and calling separate SPs from within it so that they could take advantage of a re-evaluated execution plan after the data in the work tables is built. I have also seen reference to using EXEC to run dynamic SQL which, obviously might get a RECOMPILE also.
I'm still trying to get SHOWPLAN permissions, so I'm flying quite blind.
Are you able to determine whether there are any locking problems? Are you running the SP in sufficiently small transactions?
Breaking it up into subprocedures should have no benefit.
Somebody should be concerned about your productivity, working without basic optimization resources. That suggests there may be other possible unseen issues as well.
Grab the free copy of "Dissecting Execution Plan" in the link below and maybe you can pick up a tip or two from it that will give you some idea of what's really going on under the hood of your SP.
http://dbalink.wordpress.com/2008/08/08/dissecting-sql-server-execution-plans-free-ebook/
Are you sure that the variability you're seeing is caused by "bad" execution plans? This may be a cause, but there may be a number of other reasons:
"other" load on the db machine
when using different data, there may be "easy" and "hard" data
issues with having to allocate more memory/file storage
...
Have you tried running the SP with the same data a few times?
Also, in order to figure out what is causing the runtime/variability, I'd try to do some detailed measuring to pin the problem down to a specific section of the code. (Easiest way would be to insert some log calls at various points in the sp). Then try to explain why that section is slow (other than "5M rows ;-)) and figure out a way to make that faster.
For now, I think there are a few questions to answer before going down the "splitting up the sp" route.
You're right it is quite difficult for you to get a clear picture of what is happening behind the scenes until you can get the "actual" execution plans from several executions of your overall process.
One point to consider perhaps. Are your work tables physical of temporary tables? If they are physical you will get a performance gain by inserting new data into a new table without an index (i.e. a heap) which you can then build an index on after all the data has been inserted.
Also, what is the purpose of your process. It sounds like you are moving quite a bit of data around, in which case you may wish to consider the use of partitioning. You can switch in and out data to your main table with relative ease.
Hope what I have detailed is clear but please feel free to pose further questions.
Cheers, John
In several cases I've seen this level of diversity of execution times / query plans comes down to statistics. I would recommend some tests running update stats against the tables you are using just before the process is run. This will both force a re-evaluation of the execution plan by SQL and, I suspect, give you more consistent results. Additionally you may do well to see if the differences in execution time correlate with re-indexing jobs by your dbas. Perhaps you could also gather some index health statistics before each run.
If not, as other answerers have suggested, you are more likely suffering from locking and/or contention issues.
Good luck with it.
The only thing I can think that an execution plan would do wrong when there's no data is err on the side of using a table scan instead of an index, since table scans are super fast when the whole table will fit into memory. Are there other negatives you're actually observing or are sure are happening because there's no data when an execution plan is created?
You can force usage of indexes in your query...
Seems to me like you might be going down the wrong path.
Is this an infeed or outfeed of some sort or are you creating a report? If it is a feed, I would suggest that you change the process to use SSIS which should be able to move 5 million records very fast.