SQL query processing order: Group By first or Join First? - sql

I need to execute an SQL query.
If i have a query with multiple tables in From clause with Join condition in Where clause,
And i have Group by statement,
Should i perform Join operation first followed by Group By ?
OR should i perform Group By first then Join ?
Which one would be better ?
Note: In my environment, whichever operator that filters out more tuples should be executed first for better performance and less usage of memory for overall query execution.

Use DB's EXPLAIN syntax, You'll see witch of this two methods (in Yours specific environment) will cause more DB operations taken to produce output

Related

SQL "Order of execution" vs "Order of writing"

I am a new learner of SQL language to add knowledge to my career, I came to learn that in writing a query, there is a "Order of writing" vs "Order of execution", however I can't seem to find a full list of available SQL functions listing out the hierarchy
So far from what I learn I got this table, can someone with better knowledge help confirm if my table below is correct? And perhaps add any other functions that I might have missed, I am not sure where I should put the JOIN in the table below
Also, is there a difference (either in order or name of function) if I am using different Sql platforms?
MySql vs BigQuery for eg.
Your help is deeply appreciated, big thanks in advance for reading this post by a beginner
Order of writing
Order of execution
Select
From
Top
Where
Distinct
Group by
From
Having
Where
Select
Group by
Window
Having
QUALIFY
Order by
Distinct
Second
Order by
QUALIFY
Top
Limit
Limit
SQL is a declarative language, not a procedural language. That means that the SQL compiler and optimizer determine what operations are actually run. These operations typically take the form of a directed acyclic graph (DAG) of operations.
The operators have no obvious relationship to the original query -- except that the results it generates are guaranteed to be the same. In terms of execution there are no clauses, just things like "hash join" and "filter" and "sort" -- or whatever the database implements for the DAG.
You are confusing execution with compilation and probably you just care about scoping rules.
So, to start with SQL has a set of clauses and these are in a very specified order. Your question contains this ordering -- at least for a database that supports those clauses.
The second part is the ordering for identifying identifiers. Basically, this comes down to:
Table aliases are defined in the FROM clause. So this can be considered as "first" for scoping purposes.
Column aliases are defined in the SELECT clause. By the SQL Standard, column aliases can be used in the ORDER BY. Many databases extend this to the QUALIFY (if supported), HAVING, and GROUP BY clauses. In general, databases do not support them in the WHERE clause.
If two tables in the FROM have the same column name, then the column has to be qualified to identify the table. The one exception to this is when the column is a key in a JOIN and the USING clause is used. Then the unqualified column name is fine.
If a column alias defined in the SELECT conflicts with a table alias in a clause that supports column aliases, then it is up to the database which to choose.
The whole point of SQL is that it is a 'whole set' language and there is no particular set order to much of it. Today's DBMS evaluates each Select query as a whole to determine the best, most efficient way to assemble the data set results, in much the same way that Google Maps might determine the best path to get you home based both on where you are and ambient traffic.
Databases will provide, under their Explain Plan command, exactly the sequence they will use to process your query. This called the Execution Plan. Each of these steps are performed on entire table sets and where possible under parallel processes. The steps in each plan do not have any of your names listed above, instead a step might say "perform an index scan on table A", or "perform a nested loops join on the prior partial result set and table B". In some cases they will filter records before joining and in other cases they won't, for example.
Within those parameters there are some tasks that always come before others. For example, all Where clause filtering takes place before aggregation and summary filtering (Having clause). But there are few absolute rules here.
When writing SQL, I found that the execution order of the select statement is not the same as the order of writing.
The order in which SQL query statements are written is
SELECT
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
But in fact the order of execution of the SQL statement is
FROM
WHERE
GROUP BY
HAVING
SELECT
UNION
ORDER BY
SQL will first choose where my table is selected, including the table's restrictions, (such as connection mode JOIN and restrictions ON)
SQL will choose what my judgment condition is, that is, the problem of WHERE
Then it will group by grouping and execute the HAVING statement.
SELECT statement is executed after most of the statements are executed, so we must understand that the statement executed in front of it will affect it, and pay attention to the actual work. This is especially important.
With the execution order of the statement we can find that order by the last execution, so we can sort the new fields named in select.

Is possible to cache in some variable results of subquery?

I have query in postgresql 9.1 like
SELECT id
FROM students
INNER JOIN exams ON /some condition
WHERE studentsid NOT IN (SUBQUERY);
and when I run only the subquery it executes in 120ms, when I execute the previous query without condition with subquery it executes for 12 seconds, but when I add subquery it runs half hour
Is possible to cache in some variable results of subquery (results are always same array of ids) and execute in console/pgadmin ?
I found WITH statement but it looks like is not supported in postgres
First, the with statement is supported in Postgres.
Second, you need to identify where the performance problem is. Is it in the subquery? Or is it the not in?
You can put the subquery in a table, add indexes, and make the query more efficient.
You can rewrite the subquery using a left join, which often allows the query to be better optimized.
You can add appropriate indexes to make the entire query more efficient.
Without knowledge of what the subquery actually does, the right approach is speculation.

SQL Server 2005 - Order of Inner Joins

I have a query containing three inner join statements in the Where clause. The query takes roughly 2 minutes to execute. If I simply change the order of two of the inner joins, performance drops to 40 seconds.
How can doing nothing but changing the order of the inner joins have such a drastic impact of query performance? I would have thought the optimizer would figure all this out.
SQL is declarative, that is, the JOIN order should not matter.
However it can in practice, say, if it's a complex query when the optimiser does not explore all options (which in theory could take months).
Another option is that it's a very different query if you reorder and you get different results, but this is usually with OUTER JOINs.
And it could also be the way the ON clause is specified It has to change if you reorder the FROM clause. Unless you are using the older (and bad) JOIN-in-the-WHERE-clause.
Finally, if it's a concern you could use parenthesis to change evaluation order to make your intentions clear, say, filter on a large table first to generate a derived table.
Because by changing the order of the joins, SQL Server is coming up with a different execution plan for your query (chances are it's changing the way it's filtering the tables based on your joins).
In this case, I'm guessing you have several large tables...one of which performs the majority of the filtering.
In one query, your joins are joining several of the large tables together and then filtering the records at the end.
In the other, you are filtering the first table down to a much smaller sub-set of the data...and then joining the rest of the tables in. Since that initial table got filtered before joining the other large recordsets, performance is much better.
You could always verify but running the query with the 'Show query plan' option enabled and see what the query plan is for the two different join orders.
I would have thought it was smart enough to do that as well, but clearly it's still performing the joins in the order you explicitly list them... As to why that affects the performance, if the first join produces an intermediate result set of only 100 records in one ordering scheme, then the second join will be from that 100-record set to the third table.
If putting the other join first produces a first intermediate result set of one million records, then the second join will be from a one million row result set to the third table...

Are LEFT JOIN subquery table arguments evaluated more than once?

I have a query that looks like this:
SELECT *
FROM employees e
LEFT JOIN
(
SELECT *
FROM timereports
WHERE date = '2009-05-04'
) t
ON e.id = t.employee_id
As you can see, my LEFT JOIN second table parameter is generated by a a subquery.
Does the db evaluate this subquery only once, or multiple times?
thanks.
matti
This depends on the RDBMS.
In most of them, a HASH OUTER JOIN will be employed, in which case the subquery will be evaluated once.
MySQL, on the other hand, isn't capable of making HASH JOIN's, that's why it will most probably push the predicate into the subquery and will issue this query:
SELECT *
FROM timereports t
WHERE t.employee_id = e.id
AND date = '2009-05-04'
in a nested loop. If you have an index on timereports (employee_id, date), this will also be efficient.
If you are using SQL Server, you can take a look at the Execution Plan of the query. The SQL Server query optimizer will optimize the query so that it takes the least time in execution. Best time will be based on some conditions viz. indexing and the like.
You have to ask the database to show the plan. The algorithm for doing this is chosen dynamically (at query time) based on many factors. Some databases use statistics of the key distribution to decide which algorithm to use. Other databases have relatively fixed rules.
Further, each database has a menu of different algorithms. The database could use a sort-merge algorithm, or nested loops. In this case, there may be a query flattening strategy.
You need to use your database's unique "Explain Plan" feature to look at the query execution plan.
You also need to know if your database uses hints (usually comments embedded in the SQL) to pick an algorithm.
You also need to know if your database uses statistics (sometimes called a "cost-based query optimizer) to pick an algorithm.
One you know all that, you'll know how your query is executed and if an inner query is evaluated multiple times or flattened into the parent query or evaluated once to create a temporary result that's used by the parent query.
What do you mean by evaluated?
The database has a couple of different options how to perform a join, the two most common ones being
Nested loops, in which case each row in one table will be looped through and the corresponding row in the other table will be looked up, and
Hash join, which means that both tables will be scanned once and the results are then merged using some hash algorithm.
Which of those two options is chosen depends on the database, the size of the table and the available indexes (and perhaps other things as well).

Subqueries vs joins

I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.