Suppose I have a SQL Query that isn't really "nice".
The SQL Query has a lot inner join with long conditions, and finally a simple "where" statement.
Now suppose I make a View that corresponds to that SQL Query, excepting the final "where" statement. So that instead of make that long query, I use the view and simply add the "where" statement.
My question is the following:
Which one would have best performance?
I'm not sure if the "View" option corresonds to two sql statement, instead of the one sql statement of the first option. Or before executing the query, the "View" query is "pasted" in the more simple query and is "the same".
No difference. The view uses the same query as the query alone.
To be different it would need to store it in a view table view don't, they're just queries made to look easy from the outside.
Assuming the DBMS you're using has a cost-based query optimizer, you'll probably not get any better performance when a portion of the query is converted to a view. You can confirm this with your database's query explain tool by comparing the estimated cost of both versions of the query.
I recommend using the query explain utility to determine exactly where the query becomes expensive, since the location of a query's pain points are often unexpected. It may seem that all those joins are slowing things down, but they may be exploiting indexes that minimize the I/O and CPU cycles required to connect the rows. Many times the query bogs down because an unindexed column is referenced in a WHERE or ON clause.
Related
I have a query that has joins between 15 tables or even more. I need to optimize the response time. I created some index columns, changed some conditions from NOT IN to NOT EXISTS, but I found myself wondering about this.
Does the order of these joins affect the response time?
The order of JOINs definitely does affect performance, as well as the type of join. INNER JOINs, generally, will yield quicker results than RIGHT or LEFT OUTER JOINs due to the selectivity of the join.
SQL Server also tries its best to optimize every query, but at 15 joins it may have a hard time. Consider breaking the statement up into smaller junks of fewer joins. A strategy to resolve things like this that I've used in the past is to create a temporary table to store the results, then INSERT into it and UPDATE it accordingly through several different statements, with the 15 tables being spread out into the appropriate insert/update spots.
The order of the joins definitely matters. The question is, can you change the order by rewriting the SQL? The query optimizer doesn't necessarily care what order you write the joins in, depending on the type of join. The query optimizer does its best to find the most efficient execution plan based on the SQL you've written. However, it is no where close to perfect. If you notice after looking at the execution plan that it could be done more efficiently another way you can trick it into doing it your way and see if it helps.
One way to trick it is to use temp tables to pare down the result set before joining to large tables. This will allow less records to be selected which will reduce I/O.
Another way, as demonstrated by Adam Machanic, is to use a top in the select clause with an order by.
I have a project where all SQL queries was already written. Most of them can be improved, and I want to do that.
But how I can check and analyze the both queries [pre-written and I write same queries using different way].
I want to analyze the queries if I rewrite them then I can check if I did right or if old query is better than mine.
I need a queries analyzer tool for Mysql based RDBMS.
[if any mistake goes then edit it]
MySQL has an EXPLAIN statement, which tells you what execution path the engine will follow. You can use that to determine what changes to make:
EXPLAIN SELECT foo,bar from glurch WHERE baz > 1 ORDER BY foo;
See: http://dev.mysql.com/doc/refman/5.0/en/explain-output.html
Pay special attention to the rows column, which shows how many table rows need to be examined to execute that part of the query, as well as the Extra column, which show more importantly, how the query gets executed.
For example, if you see "Using temporary" in the Extra column, that usually means that the DB will need to write the query results out to a temporary table (possibly on disk), sort them, and then re-read them (or at least some of them).
You can avoid temporary tables and other nasty performance killers by providing appropriate indexes. And not just enough indexes, but rather the correct indexes. Just because you've indexed a column doesn't mean that index can be used in your query.
When you have a good RDMBS engine, the query parser should determine the most performant execution plan for your query, no matter how it is written.
I mean: the order of the tables in your from and join clauses, or the order of your filter-criteria should not matter.
The first thing to do, is look at the execution plan and analyze it. Check wether you've to add indexes, or modify indexes for instance.
Your question is also a bit vague. Can you give a sample of a query which you'd like to refactor ?
let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.
say i have a table
Id int
Region int
Name nvarchar
select * from table1 where region = 1 and name = 'test'
select * from table1 where name = 'test' and region = 1
will there be a difference in performance?
assume no indexes
is it the same with LINQ?
Because your qualifiers are, in essence, actually the same (it doesn't matter what order the where clauses are put in), then no, there's no difference between those.
As for LINQ, you will need to know what query LINQ to SQL actually emits (you can use a SQL Profiler to find out). Sometimes the query will be the simplest query you can think of, sometimes it will be a convoluted variety of such without you realizing it, because of things like dependencies on FKs or other such constraints. LINQ also wouldn't use an * for select.
The only real way to know is to find out the SQL Server Query Execution plan of both queries. To read more on the topic, go here:
SQL Server Query Execution Plan Analysis
Should it? No. SQL is a relational algebra and the DBMS should optimize irrespective of order within the statement.
Does it? Possibly. Some DBMS' may store data in a certain order (e.g., maintain a key of some sort) despite what they've been told. But, and here's the crux: you cannot rely on it.
You may need to switch DBMS' at some point in the future. Even a later version of the same DBMS may change its behavior. The only thing you should be relying on is what's in the SQL standard.
Regarding the query given: with no indexes or primary key on the two fields in question, you should assume that you'll need a full table scan for both cases. Hence they should run at the same speed.
I don't recommend the *, because the engine should look for the table scheme before executing the query. Instead use the table fields you want to avoid unnecessary overhead.
And yes, the engine optimizes your queries, but help him :) with that.
Best Regards!
For simple queries, likely there is little or no difference, but yes indeed the way you write a query can have a huge impact on performance.
In SQL Server (performance issues are very database specific), a correlated subquery will usually have poor performance compared to doing the same thing in a join to a derived table.
Other things in a query that can affect performance include using SARGable1 where clauses instead of non-SARGable ones, selecting only the fields you need and never using select * (especially not when doing a join as at least one field is repeated), using a set-bases query instead of a cursor, avoiding using a wildcard as the first character in a a like clause and on and on. There are very large books that devote chapters to more efficient ways to write queries.
1 "SARGable", for those that don't know, are stage 1 predicates in DB2 parlance (and possibly other DBMS'). Stage 1 predicates are more efficient since they're parts of indexes and DB2 uses those first.
I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.