How to test performance for your database Best practice - sql

I need to test indexes performances for some table in my database.
After I run my query with indexes or without them I always use this code;
SELECT * FROM sys.dm_exec_query_optimizer_info;
And I receive details about my query.
My problem is:
using sys.dm_exec_query_optimizer
The details for my query are always changing making difficult to understand.
What is the best solution?
Do you know any way or best practices?

You have to learn what the query optimizer is telling you. That the data changes is good; it means that things behave differently depending on whether you have indexes or not. However, there is no standardization on how optimizer information is presented - each DBMS does it differently. If you are going to interpret the data, you must understand it.
Looking at the query plan is important. Ultimately, so to is measuring the actual performance. It depends in part on why you are looking at the indexing at all. If there's a perceived performance problem that you are addressing, then clearly you need to ensure that the problem is resolved by the index or indexes you add. You also need to ensure that the cost of adding the indexes on maintenance operations (insert, delete, update operations) is not intolerable - you have not added too many indexes. You may also need to consider disk space usage - is it OK to commit so much disk space to so many indexes.
Without more specific information about your DBMS or the particular queries, it is hard to give more specific advice.

Related

BigQuery - how to think about query optimisation when coming from a SQL Server background

I have a background that includes SQL Server and Informix database query optimisation (non big-data). I'm confident in how to maximise database performance on those systems. I've recently been working with BigQuery and big data (about 9+ months), and optimisation doesn't seem to work the same way. I've done some research and read some articles on optimisation, but I still need to better understand the basics of how to optimise on BigQuery.
In SQL Server/Informix, a lot of the time I would introduce a column index to speed up reads. BigQuery doesn't have indexes, so I've mainly been using clustering. When I've done benchmarking after introducing a cluster for a column that I thought should make a difference, I didn't see any significant change. I'm also not seeing a difference when I switch on query cacheing. This could be an unfortunate coincidence with the queries I've tried, or a mistaken perception, however with SQL Server/SQL Lite/Informix I'm used to seeing immediate significant improvement, consistently. Am I misunderstanding clustering (I know it's not exactly like an index, but I'm expecting it should work in a similar type of way), or could it just be that I've somehow been 'unlucky' with the optimisations.
And this is where the real point is. There's almost no such thing as being 'unlucky' with optimisation, but in a traditional RDBMS I would look at the execution plan and know exactly what I need to do to optimise, and find out exactly what's going on. With BigQuery, I can get the 'execution details', but it really isn't telling me much (at least that I can understand) about how to optimise, or how the query really breaks down.
Do I need a significantly different way of thinking about BigQuery? Or does it work in similar ways to an RDBMS, where I can consciously make the first JOINS eliminate as many records as possible, use 'where' clauses that focus on indexed columns, etc. etc.
I feel I haven't got the control to optimise like in a RDBMS, but I'm sure I'm missing a major point (or a few points!). What are the major strategies I should be looking at for BigQuery optimisation, and how can I understand exactly what's going on with queries? If anyone has any links to good documentation that would be fantastic - I'm yet to read something that makes me think "Aha, now I get it!".
It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)
I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).
Some of the biggest areas our team HAS seem greatly improve performance are as follows:
Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.
Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.
Good luck!

When to use hints in oracle query [duplicate]

This question already has an answer here:
When to use Oracle hints?
(1 answer)
Closed 5 years ago.
I have gone through some documentation on the net and using hints is mostly discouraged. I still have doubts about this. Can hints be really useful in production specially when same query is used by hundreds of different customer.
Is hint only useful when we know the number of records that are present in the tables? I am using leading in my query and it gives faster results when the data is very large but the performance is not that great when the records fetched are less.
This answer by David is very good but I would appreciate if someone clarified this in more details.
Most hints are a way of communicating our intent to the optimizer. For instance, the leading hint you mention means join tables in this order. Why is this necessary? Often it's because the optimal join order is not obvious, because the query is badly written or the database statistics are inaccurate.
So one use of hints such as leading is to figure out the best execution path, then to figure out why the database doesn't choose that plan without the hint. Does gathering fresh statistics solve the problem? Does rewriting the FROM clause solve the problem? If so, we can remove the hints and deploy the naked SQL.
Some times there are times where we cannot resolve this conundrum, and have to keep the hints in Production. However this should be a rare exception. Oracle have had lots of very clever people working on the Cost-Based Optimizer for many years, so its decisions are usually better than ours.
But there are other hints we would not blink to see in Production. append is often crucial for tuning bulk inserts. driving_site can be vital in tuning distributed queries.
Conversely other hints are almost always abused. Yes parallel, I'm talking about you. Blindly putting /*+ parallel (t23, 16) */ will probably not make your query run sixteen times faster, and not infrequently will result in slower retrieval than a single-threaded execution.
So, in short, there is no universally applicable advice to when we should use hints. The key things are:
understand how the database works, and especially how the cost-based optimizer works;
understand what each hint does;
test hinted queries in a proper tuning environment with Production-equivalent data.
Obviously the best place to start is the Oracle documentation. However, if you feel like spending some money, Jonathan Lewis's book on the Cost-Based Optimizer is the best investment you could make.
I couldn't just rephrase that, so I will paste it here
(it's a brief explanation as of "When Not To Use Hints", that I had bookmarked):
In summary, don’t use hints when
What the hint does is poorly understood, which is of course not limited to the (ab)use of hints;
You have not looked at the root cause of bad SQL code and thus not yet tapped into the vast expertise and experience of your DBA in tuning the database;
Your statistics are out of date, and you can refresh the statistics more frequently or even fix the statistics to a representative state;
You do not intend to check the correctness of the hints in your statements on a regular basis, which means that, when statistics change, the hint may be woefully inadequate;
You have no intention of documenting the use of hints anyway.
Source link here.
I can summarize this as: The use of hints is not only a last resort, but also a lack of knowledge on the root cause of the issue. The CBO (Cost Based Optimizer) does an excellent job, if you just ensure some basics for it. Those include:
Fresh statistics
1.1. Index statistics
1.2. Table statistics
1.3. Histograms
Correct JOIN conditions and INDEX utilization
Correct Database settings
This article here is worth reading:
Top 10 Reasons for poor Oracle performance
Presented by non other, but Mr. Donald Burleson.
Cheers
In general hints should be used only exceptional, I know following situations where they make sense:
Workaround for Oracle bugs
Example: Once for a SELECT statement I got an error ORA-01795: maximum expression number in list - 1000, although the query did not contain an IN expression at all.
Problem was: The queried table contains more than 1000 (sub-) partitions and Oracle did a transformation of my query. Using the (undocumented) hint NO_EXPAND_TABLE solved the issue.
Datewarehouse application while staging
While staging you can have significant changes on your data where the table/index statistics are not aware about as statistics are gathered only once a week by default. If you know your data structure then hints could be useful as they are faster than running DBMS_STATS.GATHER_TABLE_STATS(...) manually all the time in between your operations. On the other hand you can run DBMS_STATS.GATHER_TABLE_STATS() even for single columns which might be the better approach.
Online Application Upgrade Hints
From Oracle documentation:
The CHANGE_DUPKEY_ERROR_INDEX, IGNORE_ROW_ON_DUPKEY_INDEX, and
RETRY_ON_ROW_CHANGE hints are unlike other hints in that they have a
semantic effect. The general philosophy explained in "Hints" does not
apply for these three hints.

Is premature optimization in SQL as "evil" as it is in procedural programming languages?

I'm learning SQL at the moment and I've read that joins and subqueries can potentially be performance destroyers. I (somewhat) know the theory about algorithmic complexity in procedural programming languages and try to be mindful of that when programming, but I don't know how expensive different SQL queries can be. I'm deciding whether I should invest time in learning about SQL performance or just notice it when my queries run slow. The base question for me then is: is premature optimization for SQL as evil as it is for procedural languages?
As added information, I work in an environment where, most of the time, high performance is not an issue and the biggest tables I have to work with have some 150k rows.
Here's the Donald Knuth quote I refer to when saying "evil":
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil. Yet we should not
pass up our opportunities in that
critical 3%.
I would say that some general notions about performance are a must-have : it'll prevent you from writing really bad queries that can hurt your application (Even if you don't have millions of rows in your tables).
It'll also help you design your database so it's more officient-oriented : you'll have some ideas about where to put indexes, for instance.
But you shouldn't have performance as a first goal : the first thing is to have an application that works ; and, then, if needed, you'll optimize it (having some performance notions while developping will help you have an application that's easier to optimize, though).
Note I would not say that "having notions about performances" is "premature optimization", as long as you don't just "optimize", but just "write correctly" ; I would rather call it good practice that'll help to write better quality code ;-)
What Knuth means is: it's really, really important to know about SQL optimisation but only when you need to. As you say, "most of the time ... high performance is not an issue."
It's that 3% of times when you do need high performance that it's important to know what rules to break and why.
However, unlike procedural languages, even for 150k rows it can be important to know a little about how your query is processed. For instance free text searching will be very slow compared with searching through exact matches on indexed columns. It's going the final steps into e.g. sharding or full denormalisation where most DBAs and developers draw the line.
I wouldn't say that SQL optimization has as many pitfalls as premature programming optimization. Designing your schema and queries ahead of time with performance in mind can help you avoid some really nasty redesigns later on. That being said, spending a day getting rid of a table scan can be utterly worthless to you in the long run if that query isn't a slow query, can be cached, or is rarely called in a manner that would impact your application.
I personally profile my queries and focus on the worst, and most used queries. Careful design ahead of time cuts out most of the worst.
I would say that you should make the SQL as easily readeble as possible, and only worry about the performance once it hits you.
That said.
Be mindfull of standard things as you develop, such as indexes, sub selects, use of cursors where a standard query would do the job, etc.
It will not hurt to develop the original correctly, and you can optimize the problems later when it is needed.
EDIT
Also remeber that maintainability of your SQL code is very important, and that debugging SQL is slightly more difficult than normal coding.
Knuth says "forget about 97%" but for a typical web app it's in the database IO where 97% of the request time is spent. This is where a little optimization effort can yield greatest results.
If this is the kind of apps you're writing I strongly suggest learning as much of how RDBMSes work as you can afford. Other people give you excellent suggestions, and I'd add that I usually follow this list top-down when deciding how to spent my "optimization budget":
Schema design. Think twelve times
about normalizaton and access
strategies. This will save you many
painful hours later.
Query readability. Related to #1,
sometimes trying to reogranize your
queries gives a better understanding
of how schema should look. Also it'll
help later when you ask for
help.
Avoid subqueries in SELECT list, use
JOINs.
If there are slow queries reach for
Profiler. Check for missing indexes
first And finally, if there are
still slow queries, try to rewrite
it.
Keep in mind also, that database performance very much depends on data distribution and number of simultaneous requests (because of locking). Even though a query completes in 1 sec. on your underpowered netbook it could take 15 seconds on the 8-core server. If possible, check your queries on actual data. If you know that concurrency is going to be high it's (paradoxically) better to use many small queries than one big one.
I agree with everything that's said here, and I'd like to add: make sure that your SQL is well-encapsulated so that, when you discover what needs to be optimized, there's only one place you need to change it, and the change will be transparent to whatever code calls it.
Personally, I like to encapsulate all of my SQL in PL/SQL procedures, but there are some who disagree with that. Whatever you do, I recommend trying to avoid putting your SQL "inline" with other sourcecode. That seems to always lead to cut-and-pasting and quickly becomes hard to maintain. Put your SQL elsewhere, and try to re-use it as much as possible.
Also, read up on indexes, how they really work, and when you should and shouldn't use them. A lot of people's first instinct, when they get a slow query, is to index the table to death. That might solve the problem in the short term, but long-term an over-index table will be slow to insert and update into. A few well-chosen indexes are much better than indexing every field. Try reading "Refactoring SQL Applications" by Stephane Faroult.
Finally, as said above, a properly normalized database design will help avoid 99% of your slow queries. Denormalization is neccesary sometimes, but it's important that you know the rules, before you break them.
Good luck!

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it

DB Design: Does having 2 Tables (One is optimized for Read, one for Write) improve performance?

I am thinking about a DB Design Problem.
For example, I am designing this stackoverflow website where I have a list of Questions.
Each Question contains certain meta data that will probably not change.
Each Question also contains certain data that will be consistently changing (Recently Viewed Date, Total Views...etc)
Would it be better to have a Main Table for reading the constant meta data and doing a join
and also keeping the changing values in a different table?
OR
Would it be better to keep everything all in one table.
I am not sure if this is the case, but when updating, does the ROW lock?
When designing a database structure, it's best to normalize first and change for performance after you've profiled and benchmarked your queries. Normalization aims to prevent data-duplication, increase integrity and define the correct relationships between your data.
Bear in mind that performing the join comes at a cost as well, so it's hard to say if your idea would help any. Proper indexing with a normalized structure would be much more helpful.
And regarding row-level locks, that depends on the storage engine - some use row-level locking and some use table-locks.
Your initial database design should be based on conceptual and relational considerations only, completely indepedent of physical considerations. Database software is designed and intended to support good relational design. You will hardly ever need to relax those considerations to deal with performance. Don't even think about the costs of joins, locking, and activity type at first. Then further along, put off these considerations until all other avenues have been explored.
Your rdbms is your friend, not your adversary.
You should have the two table separated out as you might want to record the history of the question. The main Question table is indexed by question ID then the Status table is indexed by query ID and date/time stamp and contains a row for each time the status changes.
Don't know that the updates are really significant unless you were using pessimistic locking where the row would be locked for a period of time.
I would look at caching your results either locally with Asp.net caching or using MemCached.
This would certainly be a bad idea if you were using Oracle. In Oracle, you can quite happily read records while other sessions are modifying them due to it's multi-version concurency control. You would incur extra performance penalty for the join for no savings.
A design patter that is useful, however, is to pre-join tables, pre-calculate aggregates or pre-apply where clauses using materialized views.
As already said, better start with a clean normalized design. It's just easier to denormalize later, than to go the other way around. The experience teaches that you will never denormalize that one big table! You will just throw more columns in as needed. And you will need more and more indexes and updates will go slower and slower.
You should also take a look at the expected loads: Will be there more new answers or just more querying? What other operations will you have? When it comes to optimization, you can use the features of your dbms system: indexing, views, ...
Eran Galperin already provided most of my answer. In addition, the structure you propose really wouldn't help you in terms of locking. If their are relatively static and dynamic attributes in the same row, breaking the static and dynamic into two tables isn't of much benefit. It doesn't matter if static data is being locked, since no one is trying to change it anyway.
In fact, you may actually do worse with this design. Some database engines use page locking. If a table has fewer/smaller columns, more rows will fit on a page. The more rows there are on a page, the more likely there will be a lock contention. By having the static data mixed in with the dynamic, the rows are bigger, therefore there are fewer rows in a page, and therefore fewer waits on page locks.
If you have two independent sets of dynamic attributes, and they are normally modified by different actors, then you might get some benefit by breaking them into different tables. This is a pretty unusual case, however.
I'd also point out that breaking the table into a static and dynamic portion may not be of benefit in a relatively small environment, but in a large distributed environment it may be useful to cache and replicate the dynamic data at different rates than the static data.