I have a query as given below
select
student_id, student_name, student_total
from
student
where
(student_name like %a% and student_total > 400)
or student_rank < 10
In SQL engine, how it will execute this query. whether the condition will check from right to left or left to right?
No.
You don't know, and you must not depend on the answer.
The execution planner is the one that handles preparing the actual step-by-step plan of executing the query. This will tend to change with indices, statistics and such. It might very well evaluate the like first on a table with 100 rows, but the student_rank < 10 first on a table with 10 million rows and an index on student_rank. And if the statistics are right and you have an index on student_total, it might filter based on student_total first, even though it's deep inside the filter expression tree. The answer can also change with new versions of the engine, and possibly even with upgrades and updates to the server (e.g. the amount of memory available, total network and CPU load, ...)
Why do you care? It's the DB engine's problem to solve. And given that you're doing a like '%something%', it will most likely put that as the last condition pretty much always - as long as there's an index it can use for student_rank.
The fact that there's no definite order of execution also has implications that might surprise you. For example, if you have a function that throws an exception / error if it's passed a value of null, doing (SomeColumn is not null and MyFunction(SomeColumn)) is not safe - it will still throw the exception / error for any row with a null value in SomeColumn.
Only the most primitive (barely-)SQL databases have any notion of a fixed order of execution. The thing you should focus on is making the SQL readable first and foremost. Performance tweaks must always be precisely documented, along with tests to replicate the intented behaviour etc., because they are extremely fragile. Before adding index hints, make sure your indices are properly maintained, with up-to-date statistics, low fragmentation, good coverage etc. etc. When the execution planner produces sub-optimal execution plans, it's almost always your fault (and very rarely, a subtle bug in the engine, or a known limitation) - either by trying performance tricks in the SQL, or by having no DBA taking care of the maintenance.
Related
When I write code I like to make sure I'm optimizing performance. I would assume that this includes ordering the filters to have the heavy reducers (filter out lots of rows) at the top and the lighter reducers (filter out a few rows) at the bottom.
But when I have errors in my filters I have noticed that SQL Server first catches the errors in the filters at the bottom and then catches the errors in the filters at the top. Does this mean that SQL Server processes filters from the bottom up?
For example (for clarity I'm the filter - with intentional typos - in the WHERE clause rather than the JOIN clause):
select
l.Loan_Number
,l.Owner_First_Name
,l.Owner_Last_Name
,l.Street
,l.City
,l.State
,p.Balance
,p.Delinquency_Bucket
,p.Next_Due_Date
from
Location l
join Payments p on l.Account_Number = p.Account_Number
where
l.OOOOOwner_Last_Name = 'Kostoryz' -- I assume this would reduce the most, so I put it first
and p.DDDDelinquency = '90+' -- I assume this would reduce second most, so I put it second
and l.SSSState <> 'WY' -- I assume this would reduce the least, so I put it last
Yet the first error SQL Server would return would be ERROR - THERE IS NO COLUMN SSSState IN Location TABLE
The next error it would return would be ERROR - THERE IS NO COLUMN DDDDelinquency IN Payments TABLE
Does this mean that the State filter would be applied before the Delinquency filter and the Delinquency filter would be applied before the Last_Name filter?
There are roughly three stages that happen, when a query is received in text form by the DBMS until you get its result.
The text is usually transformed into some internal format, the DBMS can easier work with.
From the internal format the DBMS tries to compute an optimal way of actual execution, you can think of it as a little program that is developed there.
That program is actually executed and the result is written somewhere (in the memory) you can fetch it from.
(These stages possibly can be divided in even smaller substages, but that level of detail isn't needed here, I guess.)
Now with that in mind, note that for one the errors you mention are emitted in stage 1, when the DBMS tries to bind actual objects in the DB and cannot find them. The query is far from execution at that point and the order that binding is done has got nothing to do with the order the filters are actually applied later. Additionally thereafter is stage 2. In order to find an optimal way of execution, the DBMS can and will reorder things (not necessarily only filters). So it usually doesn't matter how you ordered the filters or how the order of binding went. The DBMS will look at them and decide which one is better to be applied earlier and which one may wait until later.
Keep in mind, that SQL is a descriptive language. Rather than telling the machine what to do -- what we'd typically do when writing programs in imperative languages -- we describe what result we want and let the machine figure out how to calculate it and how to do this in the best possible way or at least a good way.
(Of course, that optimization may not always work a 100%. Sometimes there are some tricks in queries, that help the DBMS to find a better solution. But with a query of the kind you posted, any DBMS should cope pretty well in finding a good order to apply the filters no matter how you ordered them.)
Before SQL Server attempts to run the query, it creates a Query Execution Plan (QEP). The errors you are seeing are happening while the QEP is being built. You cannot infer any information about the sequence of "filters" based on the order you get these errors.
Once you have provided a valid query, SQL Server will build a QEP and that will govern the operations it uses to satisfy the query. The QEP will be based on many factors including what indexes and statistics are available on the table - though not usually the order that you specify conditions in the WHERE clause. There are ways to do this, but it is usually not recommended.
In Short, NO. The order of the filters don't matter.
At a high level, the query goes through multiple stages before execution. The stages are:
Parsing & Normalization (where the syntax is checked and tables are validated)
Compilation & Optimization (Where the code is compiled and optimized for execution)
In the Optimization stage, the table statistics, index statistics are checked to arrive at the optimal execution plan for executing the query. So, the filers are checked based on the statistics and are applied in order based on the statistics. So, the order of filters in the query DON'T MATTER. The column statistics DO MATTER.
Read more on Stages of query execution
I am learning SQL following "SQL in 10 minutes",
Reference to use wildcards to retrieve all the records, it states that:
As a rule, you are better off not using the * wildcard unless you really do need every column in the table. Even though use of wildcards may save you the time and effort needed to list the desired columns explicitly, retrieving unnecessary columns usually slows down the performance of your retrieval and your application.
However, It consume less time to retrieve all the records than to retrieve multiple fields:
As the result indicate, wildcards for 0.02 seconds V.S. 0.1 seconds
I tested several times, wildcards faster than multiple specified columns constantly, even though time consumed varied every times.
Kudos to you for attempting to validate advice you get in a book! A single test neither invalidates the advice nor invalidates the test. It is worthwhile to dive further.
The advice provided in SQL In 10 Minutes is sound advice -- and it explicitly states that the purpose related to performance. (Another consideration is that that it makes the code unstable when the database changes.) As a note: I regularly use select t.* for ad-hoc queries.
Why are the results different? There can be multiple reasons for this:
Databases do not have deterministic performance, so other considerations -- such as other processes running on the machine or resource contention -- can affect the performance.
As mentioned in a comment, caching can be the reason. Specifically, running the first query may require loading the data from disk, and it is already in memory for the first.
Another form of caching is for the execution plan, so perhaps the first execution plan is cached but not the second.
You don't mention the database, but perhaps your database has a really, really slow compiler and compiling the first takes longer than the second.
Fundamentally, the advice is sound from a common-sense perspective. Moving less data around should be more efficient. That is really what the advice is saying.
In any case, the difference between 10 milliseconds and 2 milliseconds is very short. I would not generalize this performance to larger data and say that the second is 5 times faster than the first in general. For whatever reason, it is 8 milliseconds shorter on a very small data set, one so small that performance would not be a consideration anyway.
For manual testing the data that's in a table or tables?
Then it doesn't matter much whether you used a * or the column names.
Sure, if the table has like 100 columns and you only are interested in a few? Then explicitly adding the columnnames will give you a less convulted result.
Plus, you can choose the order they appear in the result.
And using a * in a sub-query would drag all the fields into the resultset.
While if you only selected the columns you need could improve performance.
For manual testing, that normally doesn't matter much.
Whether a test SQL runs 1 seconds or 2 seconds, if it's a test or an ad-hoc query then it wouldn't bother you.
What the suggestion is more intended for, is about coding SQL's that are to be used in a production environment.
When using * in a SQL, that means that when something changes in the tables that are used in the query, that it can affect the output of that query.
Possibly leading to errors. Your boss would frown upon that!
For example, a SQL with a select * from tableA union select * from tableB that you coded a year ago suddenly starts crashing because a column was added to tableB. Ouch.
But by explicitly putting the column names, adding a column to 1 of the tables wouldn't make any difference to that SQL.
In other words.
In production, stability and performance matter much more than golf-coding.
Another thing to keep in mind is the effect of caching.
Some databases can temporarly store metadata or even data in memory.
Which can speed up the retrieval of a query that gets the same results of a query that just run before it.
So try running the following SQL's.
Which are in a different order than in the question.
And check if there's still a speed difference.
select * from products;
select prod_id, prod_name, prod_price from products;
I have a lot of records in table. When I execute the following query it takes a lot of time. How can I improve the performance?
SET ROWCOUNT 10
SELECT StxnID
,Sprovider.description as SProvider
,txnID
,Request
,Raw
,Status
,txnBal
,Stxn.CreatedBy
,Stxn.CreatedOn
,Stxn.ModifiedBy
,Stxn.ModifiedOn
,Stxn.isDeleted
FROM Stxn,Sprovider
WHERE Stxn.SproviderID = SProvider.Sproviderid
AND Stxn.SProviderid = ISNULL(#pSProviderID,Stxn.SProviderid)
AND Stxn.status = ISNULL(#pStatus,Stxn.status)
AND Stxn.CreatedOn BETWEEN ISNULL(#pStartDate,getdate()-1) and ISNULL(#pEndDate,getdate())
AND Stxn.CreatedBy = ISNULL(#pSellerId,Stxn.CreatedBy)
ORDER BY StxnID DESC
The stxn table has more than 100,000 records.
The query is run from a report viewer in asp.net c#.
This is my go-to article when I'm trying to do a search query that has several search conditions which might be optional.
http://www.sommarskog.se/dyn-search-2008.html
The biggest problem with your query is the column=ISNULL(#column, column) syntax. MSSQL won't use an index for that. Consider changing it to (column = #column AND #column IS NOT NULL)
You should consider using the execution plan and look for missing indexes. Also, how long it takes to execute? What is slow for you?
Maybe you could also not return so many rows, but that is just a guess. Actually we need to see your table and indexes plus the execution plan.
Check sql-tuning-tutorial
For one, use SELECT TOP () instead of SET ROWCOUNT - the optimizer will have a much better chance that way. Another suggestion is to use a proper inner join instead of potentially ending up with a cartesian product using the old style table,table join syntax (this is not the case here but it can happen much easier with the old syntax). Should be:
...
FROM Stxn INNER JOIN Sprovider
ON Stxn.SproviderID = SProvider.Sproviderid
...
And if you think 100K rows is a lot, or that this volume is a reason for slowness, you're sorely mistaken. Most likely you have really poor indexing strategies in place, possibly some parameter sniffing, possibly some implicit conversions... hard to tell without understanding the data types, indexes and seeing the plan.
There are a lot of things that could impact the performance of query. Although 100k records really isn't all that many.
Items to consider (in no particular order)
Hardware:
Is SQL Server memory constrained? In other words, does it have enough RAM to do its job? If it is swapping memory to disk, then this is a sure sign that you need an upgrade.
Is the machine disk constrained. In other words, are the drives fast enough to keep up with the queries you need to run? If it's memory constrained, then disk speed becomes a larger factor.
Is the machine processor constrained? For example, when you execute the query does the processor spike for long periods of time? Or, are there already lots of other queries running that are taking resources away from yours...
Database Structure:
Do you have indexes on the columns used in your where clause? If the tables do not have indexes then it will have to do a full scan of both tables to determine which records match.
Eliminate the ISNULL function calls. If this is a direct query, have the calling code validate the parameters and set default values before executing. If it is in a stored procedure, do the checks at the top of the s'proc. Unless you are executing this with RECOMPILE that does parameter sniffing, those functions will have to be evaluated for each row..
Network:
Is the network slow between you and the server? Depending on the amount of data pulled you could be pulling GB's of data across the wire. I'm not sure what is stored in the "raw" column. The first question you need to ask here is "how much data is going back to the client?" For example, if each record is 1MB+ in size, then you'll probably have disk and network constraints at play.
General:
I'm not sure what "slow" means in your question. Does it mean that the query is taking around 1 second to process or does it mean it's taking 5 minutes? Everything is relative here.
Basically, it is going to be impossible to give a hard answer without a lot of questions asked by you. All of these will bear out if you profile the queries, understand what and how much is going back to the client and watch the interactions amongst the various parts.
Finally depending on the amount of data going back to the client there might not be a way to improve performance short of hardware changes.
Make sure Stxn.SproviderID, Stxn.status, Stxn.CreatedOn, Stxn.CreatedBy, Stxn.StxnID and SProvider.Sproviderid all have indexes defined.
(NB -- you might not need all, but it can't hurt.)
I don't see much that can be done on the query itself, but I can see things being done on the schema :
Create an index / PK on Stxn.SproviderID
Create an index / PK on SProvider.Sproviderid
Create indexes on status, CreatedOn, CreatedBy, StxnID
Something to consider: When ROWCOUNT or TOP are used with an ORDER BY clause, the entire result set is created and sorted first and then the top 10 results are returned.
How does this run without the Order By clause?
I have a query that looks something like this:
select xmlelement("rootNode",
(case
when XH.ID is not null then
xmlelement("xhID", XH.ID)
else
xmlelement("xhID", xmlattributes('true' AS "xsi:nil"), XH.ID)
end),
(case
when XH.SER_NUM is not null then
xmlelement("serialNumber", XH.SER_NUM)
else
xmlelement("serialNumber", xmlattributes('true' AS "xsi:nil"), XH.SER_NUM)
end),
/*repeat this pattern for many more columns from the same table...*/
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
It's ugly and I don't like it, and it is also the slowest executing query (there are others of similar form, but much smaller and they aren't causing any major problems - yet). Maintenance is relatively easy as this is mostly a generated query, but my concern now is for performance. I am wondering how much of an overhead there is for all of these case expressions.
To see if there was any difference, I wrote another version of this query as:
select xmlelement("rootNode",
xmlforest(XH.ID, XH.SER_NUM,...
(I know that this query does not produce exactly the same, thing, my plan was to move the logic for handling the renaming and xsi:nil attribute to XSL or maybe to PL/SQL)
I tried to get execution plans for both versions, but they are the same. I'm guessing that the logic does not get factored into the execution plan. My gut tells me the second version should execute faster, but I'd like some way to prove that (other than writing a PL/SQL test function with timing statements before and after the query and running that code over and over again to get a test sample).
Is it possible to get a good idea of how much the case-when will cost?
Also, I could write the case-when using the decode function instead. Would that perform better (than case-statements)?
Just about anything in your SELECT list, unless it is a user-defined function which reads a table or view, or a nested subselect, can usually be neglected for the purpose of analyzing your query's performance.
Open your connection properties and set the value SET STATISTICS IO on. Check out how many reads are happening. View the query plan. Are your indexes being used properly? Do you know how to analyze the plan to see?
For the purposes of performance tuning you are dealing with this statement:
SELECT *
FROM XH
WHERE XH.ID = 'SOMETHINGOROTHER'
How does that query perform? If it returns in markedly less time than the XML version then you need to consider the performance of the functions, but I would astonished if that were the case (oh ho!).
Does this return one row or several? If one row then you have only two things to work with:
is XH.ID indexed and, if so, is the index being used?
does the "many more columns from the same table" indicate a problem with chained rows?
If the query returns several rows then ... Well, actually you have the same two things to work with. It's just the emphasis is different with regards to indexes. If the index has a very poor clustering factor then it could be faster to avoid using the index in favour of a full table scan.
Beyond that you would need to look at physical problems - I/O bottlenecks, poor interconnects, a dodgy disk. The reason why your scope for tuning the query is so restricted is because - as presented - it is a single table, single column read. Most tuning is about efficient joining. Now if XH transpires to be a view over a complex query then it is a different matter.
You can use good old tkprof to analyze statistics. One of the many forms of ALTER SESSION that turn on stats gathering. The DBMS_PROFILER package also gathers statistics if your cursor is in a PL/SQL code block.
Are there any good ways to objectively measure a query's performance in Oracle 10g? There's one particular query that I've been tuning for a few days. I've gotten a version that seems to be running faster (at least based on my initial tests), but the EXPLAIN cost is roughly the same.
How likely is it that the EXPLAIN cost is missing something?
Are there any particular situations where the EXPLAIN cost is disproportionately different from the query's actual performance?
I used the first_rows hint on this query. Does this have an impact?
How likely is it that the EXPLAIN cost is missing something?
Very unlikely. In fact, it would be a level 1 bug :)
Actually, if your statistics have changed significantly from the time you ran the EXPLAIN, the actual query plan will differ. But as soom as the query is compliled, the plan will remain the same.
Note EXPLAIN PLAN may show you things that are likely to happen but may never happen in an actual query.
Like, if you run an EXPLAIN PLAN on a hierarchical query:
SELECT *
FROM table
START WITH
id = :startid
CONNECT BY
parent = PRIOR id
with indexes on both id and parent, you will see an extra FULL TABLE SCAN which most probably will not happen in real life.
Use STORED OUTLINE's to store and reuse the plan no matter what.
Are there any particular situations where the EXPLAIN cost is disproportionately different from the query's actual performance?
Yes, it happens very very often on complicate queries.
CBO (cost based optimizer) uses calculated statistics to evaluate query time and choose optimal plan.
If you have lots of JOIN's, subqueries and these kinds on things in your query, its algorithm cannot predict exactly which plan will be faster, especially when you hit memory limits.
Here's the particular situation you asked about: HASH JOIN, for instance, will need several passes over the probe table if the hash table will not fit into pga_aggregate_table, but as of Oracle 10g, I don't remember this ever to be taken into account by CBO.
That's why I hint every query I expect to run for more than 2 seconds in a worst case.
I used the first_rows hint on this query. Does this have an impact?
This hint will make the optimizer to use a plan which has lower response time: it will return first rows as soon as possible, despite the overall query time being larger.
Practically, it almost always means using NESTED LOOP's instead of HASH JOIN's.
NESTED LOOP's have poorer overall performance on large datasets, but they return the first rows faster (since no hash table needs to be built).
As for the query from your original question, see my answer here.
Q: Are there any good ways to objectively measure a query's performance in Oracle 10g?
Oracle tracing is the best way to measure performance. Execute the query and let Oracle instrument the execution. In the SQLPlus environment, it's very easy to use AUTOTRACE.
http://asktom.oracle.com/tkyte/article1/autotrace.html (article moved)
http://tkyte.blogspot.com/2007/04/when-explanation-doesn-sound-quite.html
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:5671636641855
And enabling Oracle trace in other environments isn't that difficult.
Q: There's one particular query that I've been tuning for a few days. I've gotten a version that seems to be running faster (at least based on my initial tests), but the EXPLAIN cost is roughly the same.
The actual execution of the statement is what needs to be measured. EXPLAIN PLAN does a decent job of predicting the optimizer plan, but it doesn't actually measure the performance.
Q:> 1 . How likely is it that the EXPLAIN cost is missing something?
Not very likely, but I have seen cases where EXPLAIN PLAN comes up with a different plan than the optimizer.
Q:> 2 . Are there any particular situations where the EXPLAIN cost is disproportionately different from the query's actual performance?
The short answer is that I've not observed any. But then again, there's not really a direct correlation between the EXPLAIN PLAN cost and the actual observed performance. It's possible for EXPLAIN PLAN to give a really high number for cost, but to have the actual query run in less than a second. EXPLAIN PLAN does not measure the actual performance of the query, for that you need Oracle trace.
Q:> 3 . I used the first_rows hint on this query. Does this have an impact?
Any hint (like /*+ FIRST_ROWS */) may influence which plan is selected by the optimizer.
The "cost" returned by the EXPLAIN PLAN is relative. It's an indication of performance, but not an accurate gauge of it. You can't translate a cost number into a number of disk operations or a number of CPU seconds or number of wait events.
Normally, we find that a statement with an EXPLAIN PLAN cost shown as 1 is going to run "very quickly", and a statement with an EXPLAIN PLAN cost on the order of five or six digits is going to take more time to run. But not always.
What the optimizer is doing is comparing a lot of possible execution plans (full table scan, using an index, nested loop join, etc.) The optimizer is assigning a number to each plan, then selecting the plan with the lowest number.
I have seen cases where the optimizer plan shown by EXPLAIN PLAN does NOT match the actual plan used when the statement is executed. I saw that a decade ago with Oracle8, particularly when the statement involved bind variables, rather than literals.
To get an actual cost for statement execution, turn on tracing for your statement.
The easiest way to do this is with SQLPlus AUTOTRACE.
[http://asktom.oracle.com/tkyte/article1/autotrace.html][4]
Outside the SQLPlus environment, you can turn on Oracle tracing:
alter session set timed_statistics = true;
alter session set tracefile_identifier = here_is_my_session;
alter session set events '10046 trace name context forever, level 12'
--alter session set events '10053 trace name context forever, level 1'
select /*-- your_statement_here --*/ ...
alter session set events '10046 trace name context off'
--alter session set events '10053 trace name context off'
This puts a trace file into the user_dump_dest directory on the server. The tracefile produced will have the statement plan AND all of the wait events. (The assigned tracefile identifier is included in the filename, and makes it easier to find your file in the udump directory)
select value from v$parameter where name like 'user_dump_dest'
If you don't have access to the tracefile, you're going to need to get help from the dba to get you access. (The dba can create a simple shell script that developers can run against a .trc file to run tkprof, and change the permissions on the trace file and on the tkprof output. You can also use the newer trcanlzr. There are Oracle metalink notes on both.
AFAIK, EXPLAIN is using some database statistics to calculate the cost, so it can definitely differ from the actual performance.
In my experience EXPLAIN has been accurate and beneficial. If it wasn't it might not be the useful tool it is. When was the last time you analyzed the tables? I have seen where the Explain plan was nearly the same before and after an analyze, but the analyze made a huge performance gain.