How far does optimization in sql queries go?

How far does optimization in sql queries go? - sql

My background is application programming and there is a guideline that says to not try to "outthink" the compiler e.g. JIT etc when it comes to optimization.
Is this the case also with SQL queries?
I mean I have read that the SQL servers do some kind of execution plan for a query that is expected to be optimal (right?) but do the rearange/modify the actual queries?
Or is the programmer expected to make sure the queries are optimal? E.g. first select and then join etc

My experience, which includes working for a database server vendor, is as follows.
First, databases have been highly optimized, and compiled to machine code (often written in C or C++). On modern equipment most operations are so fast, that sub-optimal execution won't be noticed.
However, there are some areas to be aware of.
If you have no indexes, then the database has to do a table scan and that can be slow. Many people only put one field into an index, but you should consider multiple fields as they apply. The explain utilities are there to show you what index it found, and suggest what index would help.
Co-related queries can be slow. Also when you have a where clause with an expression, the database has to evaluate that for each record, and cannot use an index.
Opening a connection is slow, so be sure to reuse the connection and not re-open it for every operation.
However, the biggest issue today is typically the network communication between the database client and the database server. Try to minimize the network turns to the database, and have the database filter results so less data needs to be sent over the network.

There are things that you want to let the Database do, and there are things that only people can do. Database Management cannot be left up to the database itself. People have to be involved.
Database Optimization is both an art and a science. The Database does a great job of optimizing queries by selecting the best index from those that are already created. However, databases don't automatically create the best indexes. It is the job of a DBA/Programmer to determine what the best indexes are.
An index may make the query extremely fast, but it may require 1 GB of memory. That is not an index you generally want to add. A person can look at the query, though, and realize that a slight reformatting of the query is all that is needed.
A developer with knowledge of the data itself is equipped to make good decisions on what indexes to use and such. It is also good to review your indexes to see if some of them are even being used. Sometimes indexes are created and never used by the database, because a different index is always better or a search is never run that needs the index.
So, databases make great decisions on how to run queries most efficiently based on the indexes that they already have, but it is our job to analyze whether or not the databases have the right indexes and take appropriate action.

In general, the advice is good. Many more person-years of development go into the creation of the optimization engine than you are going to manage.
That said, there are definitely pitfalls with every database. In some cases, you need to express certain logic in a certain way to make it more efficient. Or, you might need to add hints to get the right execution path.
This is because optimization for SQL is generally much more difficult than optimization for other languages. It requires understanding the data and the distribution of values to arrive at the best solution.
My advice is to write the queries in a way that best expresses what you want done, to write them with naming conventions and indentation that convey the purpose of the query. That way, if you do have to modify the query, you will at least understand what it is doing.

There are situations where your own knowlege comes in handy. Here are some examples.
1 - you want everything for this month. This is straightforward
where Year(datefield) = 2013
and month(datefield) = 'February'
but this will run faster
where datefield >= '2013-02-01'
and datefield < '2013-03-01'
2 - you want boys named Pat. Sex is indexed, name is not. this is faster
where sex = 'M'
and name = 'Pat'
than this
where name = 'Pat'
and sex = 'M'
3 - in a case construct, list the situation that will occur most often first. This
case when something that almost always happens then 'yes' else 'no' end
will run faster than
case when something that almost never happens then 'no' else 'yes' end

Related

is "where (ParamID = #ParamID) OR (#ParamID = -1)" a good practice in sql selection

i used to write sql statments like
select * from teacher where (TeacherID = #TeacherID) OR (#TeacherID = -1)
read more
and pass #TeacherID value = -1 to select all teachers
now i'm worry about the performance
can you tell me is that a good practice or bad one?
many thanks

If TeacherID is indexed and you are passing a value other than -1 as TeacherID to search for details of a specific teacher then this query will end up doing a full table scan rather than the potentially far more efficient option of seeking into the index to retrieve the details of the specific teacher...
... Unless you are on SQL 2008 SP1 CU5 and later and use the OPTION (RECOMPILE) hint. See Dynamic Search Conditions in T-SQL for the definitive article on the topic.

We use this in a very limited fashion in stored procedures.
The problem is that the database engine isn't able to keep a good query plan for it. When dealing with a lot of data this can have a serious negative performance impact.
However, for smaller data sets (I'd say less than 1000 records, but that's a guess) it should be fine. You'll have to test in your particular environment.
If it's in a stored procedure, you might want to include something like a WITH RECOMPILE option so that the plan is regenerated on each execution. This adds (slightly) to the time for each run, but over several runs can actually reduce the average execution time. Also, this allows the database to inspect the actual query and "short circuit" the parts that aren't necessary on each call.
If you are directly creating your SQL and passing it through, then I'd suggest you make the part that builds your sql a little smarter so that it only includes the part of the where clause you actually need.
Another path you might consider is using UNION ALL queries as opposed to optional parameters. For example:
SELECT * FROM Teacher WHERE (TeacherId = #TeacherID)
UNION ALL
SELECT * FROM Teacher WHERE (#TeacherId = -1)
This actually accomplishes the exact same thing; however, the query plan is cacheable. We've used this method in a few places as well and saw performance improvements over using WITH RECOMPILE. We don't do this everywhere because some of our queries are extremely complicated and I'd rather have a performance hit than to complicate them further.
Ultimately though, you need to do a lot of testing.
There is a second part here that you should reconsider. SELECT *. It is ALWAYS preferable to actually name the columns you want returned and to make sure that you are only returning the ones you will actually need. Moving data across network boundaries is very expensive and you can generally get a fair amount of performance boost simply by specifying exactly what you want. In addition if what you need is very limited you can sometimes do covering indexes so that the database engine doesn't even have to touch the underlying tables to get the data you want.

If you're really worried about performance, you could break up your procedure to call on two different procs: one for all records, and one based on the parameter.
If #TeacherID = -1
exec proc_Get_All_Teachers
else
exec proc_Get_Teacher_By_TeacherID #TeacherID
Each one can be optimized individually.
It's your system, compare the performance. Consider optimizing on the most popular choice. If most users are going to select a single record, why hider their preformance just to accomodate the few that selct all teachers (And should have a reasonable expectation of performance.).
I know a single select query is easier to maintain, but at some point ease of maintenance eventually gives way to performance.

How to tell if a query will scale well?

What are some of the methods/techniques experienced SQL developers use to determine if a particular SQL query will scale well as load increases, rows in associated tables increase etc.

Some rules that I follow that make the most difference.
Don't use per-row functions in your queries like if, case, coalesce and so on. Work around them by putting data in the database in the format you're going to need it, even if that involves duplicate data.
For example, if you need to lookup surnames fast, store them in the entered form and in their lowercase form, and index the lowercase form. Then you don't have to worry about things like select * from tbl where lowercase(surname) = 'smith';
Yes, I know that breaks 3NF but you can still guarantee data integrity by judicious use of triggers or pre-computed columns. For example, an insert/update trigger on the table can force the lower_surname column to be set to the lowercase version of surname.
This moves the cost of conversion to the insert/update (which happens infrequently) and away from the select (which happens quite a lot more). You basically amortise the cost of conversion.
Make sure that every column used in a where clause is indexed. Not necessarily on its own but at least as the primary part of a composite key.
Always start off in 3NF and only revert if you have performance problems (in production). 3NF is often the easiest to handle and reverting should only be done when absolutely necessary.
Profile, in production (or elsewhere, as long as you have production data and schemas). Database tuning is not a set-and-forget operation unless the data in your tables never changes (very rare). You should be monitoring, and possibly tuning, periodically to avoid the possibility that changing data will bring down performance.
Don't, unless absolutely necessary, allow naked queries to your database. Try to control what queries can be run. Your job as a DBA will be much harder if some manager can come along and just run:
select * from very_big_table order by column_without_index;
on your database.
If managers want to be able to run ad-hoc queries, give them a cloned DBMS (or replica) so that your real users (the ones that need performance) aren't affected.
Don't use union when union all will suffice. If you know that there can be no duplicates between two selects of a union, there's no point letting the DBMS try to remove them.
Similarly, don't use select distinct on a table if you're retrieving all the primary key columns (or all columns in a unique constraint). There is no possibility of duplicates in those cases so, again, you're asking the DBMS to do unnecessary work.
Example: we had a customer with a view using select distinct * on one of their tables. Querying the view took 50 seconds. When we replaced it with a view starting select *, the time came down to sub-second. Needless to say, I got a good bottle of red wine out of that :-)
Try to avoid select * as much as possible. In other words, only get the columns you need. This makes little difference when you're using MySQL on your local PC but, when you have an app in California querying a database in Inner Mongolia, you want to minimise the amount of traffic being sent across the wire as much as possible.

don't make tables wide, keep them narrow as well as the indexes. Make sure that queries are fully covered by indexes and that those queries are SARGable.
Test with a ton of data before going in production, take a look at this: Your testbed has to have the same volume of data as on production in order to simulate normal usage

Pull up the execution plan and look for any of the following:
Table Scan
[Clustered] Index Scan
RID Lookup
Bookmark Lookup
Key Lookup
Nested Loops
Any of those things (in descending order from most to least scalable) mean that the database/query likely won't scale to much larger tables. An ideal query will have mostly index seeks, hash or merge joins, the occasional sort, and other low-impact operations (spools and so on).
The only way to prove that it will scale, as other answers have pointed out, is to test it on data of the desired size. The above is just a rule of thumb.

In addition (and along the same lines) to Robert's suggestion, consider the execution plan. Is it utilizing indexes? Are there any scans or such? Can you simply for the query in any way? For example, Eliminate IN in favor of EXISTS and only join to tables you need to join to.
You don't mention the technology -- keep in mind that different technologies can affect the efficiency of more complex queries.

I strongly recommend reading some reference material on this. This (hyperlink below) is probably a pretty good book to look into. Make sure to look under "Selectivity", among other topics.
SQL Tuning - Dan Tow

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?

This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.

There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.

I'm with you - a single SQL would be better

There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.

It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.

The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!

You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.

Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.

The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.

A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.

Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)

You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.

There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

performance - single join select vs. multiple simple selects

What is better as far as performance goes?

There is only one way to know: Time it.
In general, I think a single join enables the database to do a lot of optimizations, as it can see all the tables it needs to scan, overhead is reduced, and it can build up the result set locally.
Recently, I had about 100 select-statements which I changed into a JOIN in my code. With a few indexes, I was able to go from 1 minute running time to about 0.6 seconds.

Do not try to write your own join loop as a bunch of selects. Your database server has many clever algorithms for doing joins. Further, your database server can use statistics and estimated cost of access to dynamically pick a join algorithm.
The database server's join algorithm is -- usually -- better than anything you might concoct. They know more about physical I/O, caching and what-not.
This allows you to focus on your problem domain.

A single join will usually outperform multiple single selects. However, there are too many different cases that fit your question. It isn't wise to lump them together under a single simple rule.
More important, a single join will usually be easier for the next programmer to understand and to revise, provided that you and the next programmer "speak the same language" when you use SQL. I'm talking about the language of sets of tuples.
And equally important is that database physical design and query design need to focus first on the questions that will result in a ten for one speed improvement, not on a 10% speed imporvement. If you were doing thousands of simple selects versus a single join, you might get a ten for one advantage. If you are doing three or four simple selects, you won't see a big improvement one way or the other.

One thing to consider besides what has been said, is that the selects will return more data through the network than the joins probably will. If the network connection is already a bottleneck, this could make it much worse, especially if this is done frequently. That said, your best bet in any performacne situation is to test, test, test.

It all depends on how the database will optimize the joins, and the use of indexes.
I had a slow and complex query with lots of joins. Then i subdivided it into 2 or 3 less complex querys. The performance gain was astonishing.
But in the end, "it depends", you have to know where´s the bottleneck.

As has been said before, there is no right answer without context.
The answer to this is dependent on (from the top of my head):
the amount of joining
the type of joining
indexing
the amount of re-use you could have for any of the separate pieces to be joined
the amount of data to be processed
the server setup
etc.

If you are using SQL Server (I am not sure if this is available with other RDBMSs) I would suggest that you bundle an execution plan with you query results. This will give you the ability to see exactly how your query(s) are being executed and what is causing any bottlenecks.
Until you know what SQL Server is actually doing I wouldn't hazard a guess about which query is better.

If your database has lots of data .... and there are multiple joins then please use indexing for better performance.
If there are left/right outer joins in this case , then use multiple selects.
It all depends on your db size, your query, the indexes (which include primary and foreign keys also) ... One cannot reach on conclusion with yes/no on your question.

How do you optimize tables for specific queries?

What are the patterns you use to determine the frequent queries?
How do you select the optimization factors?
What are the types of changes one can make?

This is a nice question, if rather broad (and none the worse for that).
If I understand you, then you're asking how to attack the problem of optimisation starting from scratch.
The first question to ask is: "is there a performance problem?"
If there is no problem, then you're done. This is often the case. Nice.
On the other hand...
Determine Frequent Queries
Logging will get you your frequent queries.
If you're using some kind of data access layer, then it might be simple to add code to log all queries.
It is also a good idea to log when the query was executed and how long each query takes. This can give you an idea of where the problems are.
Also, ask the users which bits annoy them. If a slow response doesn't annoy the user, then it doesn't matter.
Select the optimization factors?
(I may be misunderstanding this part of the question)
You're looking for any patterns in the queries / response times.
These will typically be queries over large tables or queries which join many tables in a single query. ... but if you log response times, you can be guided by those.
Types of changes one can make?
You're specifically asking about optimising tables.
Here are some of the things you can look for:
Denormalisation. This brings several tables together into one wider table, so in stead of your query joining several tables together, you can just read one table. This is a very common and powerful technique. NB. I advise keeping the original normalised tables and building the denormalised table in addition - this way, you're not throwing anything away. How you keep it up to date is another question. You might use triggers on the underlying tables, or run a refresh process periodically.
Normalisation. This is not often considered to be an optimisation process, but it is in 2 cases:
updates. Normalisation makes updates much faster because each update is the smallest it can be (you are updating the smallest - in terms of columns and rows - possible table. This is almost the very definition of normalisation.
Querying a denormalised table to get information which exists on a much smaller (fewer rows) table may be causing a problem. In this case, store the normalised table as well as the denormalised one (see above).
Horizontal partitionning. This means making tables smaller by putting some rows in another, identical table. A common use case is to have all of this month's rows in table ThisMonthSales, and all older rows in table OldSales, where both tables have an identical schema. If most queries are for recent data, this strategy can mean that 99% of all queries are only looking at 1% of the data - a huge performance win.
Vertical partitionning. This is Chopping fields off a table and putting them in a new table which is joinned back to the main table by the primary key. This can be useful for very wide tables (e.g. with dozens of fields), and may possibly help if tables are sparsely populated.
Indeces. I'm not sure if your quesion covers these, but there are plenty of other answers on SO concerning the use of indeces. A good way to find a case for an index is: find a slow query. look at the query plan and find a table scan. Index fields on that table so as to remove the table scan. I can write more on this if required - leave a comment.
You might also like my post on this.

That's difficult to answer without knowing which system you're talking about.
In Oracle, for example, the Enterprise Manager lets you see which queries took up the most time, lets you compare different execution profiles, and lets you analyze queries over a block of time so that you don't add an index that's going to help one query at the expense of every other one you run.

Your question is a bit vague. Which DB platform?
If we are talking about SQL Server:
Use the Dynamic Management Views. Use SQL Profiler. Install the SP2 and the performance dashboard reports.
After determining the most costly queries (i.e. number of times run x cost one one query), examine their execution plans, and look at the sizes of the tables involved, and whether they are predominately Read or Write, or a mixture of both.
If the system is under your full control (apps. and DB) you can often re-write queries that are badly formed (quite a common occurrance), such as deep correlated sub-queries which can often be re-written as derived table joins with a little thought. Otherwise, you options are to create covering non-clustered indexes and ensure that statistics are kept up to date.

For MySQL there is a feature called log slow queries
The rest is based on what kind of data you have and how it is setup.

In SQL server you can use trace to find out how your query is performing. Use ctrl + k or l
For example if u see full table scan happening in a table with large number of records then it probably is not a good query.
A more specific question will definitely fetch you better answers.

If your table is predominantly read, place a clustered index on the table.

My experience is with mainly DB2 and a smattering of Oracle in the early days.
If your DBMS is any good, it will have the ability to collect stats on specific queries and explain the plan it used for extracting the data.
For example, if you have a table (x) with two columns (date and diskusage) and only have an index on date, the query:
select diskusage from x where date = '2008-01-01'
will be very efficient since it can use the index. On the other hand, the query
select date from x where diskusage > 90
would not be so efficient. In the former case, the "explain plan" would tell you that it could use the index. In the latter, it would have said that it had to do a table scan to get the rows (that's basically looking at every row to see if it matches).
Really intelligent DBMS' may also explain what you should do to improve the performance (add an index on diskusage in this case).
As to how to see what queries are being run, you can either collect that from the DBMS (if it allows it) or force everyone to do their queries through stored procedures so that the DBA control what the queries are - that's their job, keeping the DB running efficiently.

indices on PKs and FKs and one thing that always helps PARTITIONING...

1. What are the patterns you use to determine the frequent queries?
Depends on what level you are dealing with the database. If you're a DBA or a have access to the tools, db's like Oracle allow you to run jobs and generate stats/reports over a specified period of time. If you're a developer writing an application against a db, you can just do performance profiling within your app.
2. How do you select the optimization factors?
I try and get a general feel for how the table is being used and the data it contains. I go about with the following questions.
Is it going to be updated a ton and on what fields do updates occur?
Does it have columns with low cardinality?
Is it worth indexing? (tables that are very small can be slowed down if accessed by an index)
How much maintenance/headache is it worth to have it run faster?
Ratio of updates/inserts vs queries?
etc.
3. What are the types of changes one can make?
-- If using Oracle, keep statistics up to date! =)
-- Normalization/De-Normalization either one can improve performance depending on the usage of the table. I almost always normalize and then only if I can in no other practical way make the query faster will de-normalize. A nice way to denormalize for queries and when your situation allows it is to keep the real tables normalized and create a denormalized "table" with a materialized view.
-- Index judiciously. Too many can be bad on many levels. BitMap indexes are great in Oracle as long as you're not updating the column frequently and that column has a low cardinality.
-- Using Index organized tables.
-- Partitioned and sub-partitioned tables and indexes
-- Use stored procedures to reduce round trips by applications, increase security, and enable query optimization without affecting users.
-- Pin tables in memory if appropriate (accessed a lot and fairly small)
-- Device partitioning between index and table database files.
..... the list goes on. =)
Hope this is helpful for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas