Please explain.
a) "subquery factoring" is used to replace a non-correlated subquery. What about correlated subquery? Is there any way to move a correlated sub-query to 'WITH' clause section?
b) are "subquery" "subquery factoring" executed exactly once?
c) "subquery" vs "subquery factoring" which one is better
Thank you.
You can use subquery factoring to replace a non-correlated subquery.
How on Earth do you propose doing so for a correlated subquery?
I don't understand part (b), can you rephrase?
Taking a guess at what you mean: a subquery in the WITH clause is typically executed only once before the main query is executed.
For large datasets, subquery factoring is obviously better since you're executing the subquery only once in most if not all cases. For smaller datasets the overhead of creating temporary tables may take longer than the actual query.
Apart from the performance concerns mentioned above, subquery factoring results in much cleaner and easer-to-maintain code.
By the term "subquery factoring" do you really mean refactoring using a subquery? Refactoring is the process of altering a routine to improve maintenance and readability without altering its result. There are times when one cannot refactor a subquery into a common table expression (into "WITH" clause). Further, there is no golden rule about always using a CTE or always using a subquery (or derived table). It depends on the data and the DBMS as to what approach will perform best.
Related
I am having a trouble with a conversion from Postgre to Bigquery query.
Please, how could this query work on bigquery ? I need to subselect a table and bring the last register
(select g.fase__c
from operacoes.salesforce_gest_o_do_canal_c_full g
where g.n_de_indentifica_o_do_parceiro__c = a.n_de_indentifica_o_do_parceiro__c
order by g.data_do_credenciamento__c limit 1) as fase_jornada
from operacoes.salesforce_account_full
If I try to execute, Bigquery returns an error I should apply a join, If I apply the join, the order by doesn´t work
Thanks!
A correlated subquery (which is a subquery that references a table of an outer query), if executed "as is", would need to compute the subquery for each row of the results of the outer query, which would not be very efficient.
In order to optimize this, BigQuery first transforms (decorrelates) the query into one that is functionally equivalent to the correlated one but without any correlations between queries and subqueries. Different decorrelation processes are needed for different use cases and they can be quite complex and difficult to implement and properly test.
BigQuery is implementing decorrelation strategies to make a variety of correlated subqueries possible, however decorrelation for the LIMIT clause has not been implemented.
A workaround would be to use ARRAY_AGG instead of a subquery. In your case, I believe the following query would do the work:
SELECT
ARRAY_AGG(g.fase__c
ORDER BY g.data_do_credenciamento__c
LIMIT 1) AS fase_jornada
FROM
operacoes.salesforce_account_full a
JOIN
operacoes.salesforce_gest_o_do_canal_c_full g
ON
g.n_de_indentifica_o_do_parceiro__c = a.n_de_indentifica_o_do_parceiro__c
GROUP BY
g.n_de_indentifica_o_do_parceiro__c
Take into account that I have guessed some details since the whole context for the subquery was not provided, so you may need to make some changes to it.
Please let me know if anything was not clear!
I've seen many examples of SQL with complex nested subqueries (and subsubqueries and subsubsubqueries and...). Is there ever any reason to write complex queries like this instead of using WITH and CTEs, just as one would use variables in a programming language to break up complex expressions?
In particular, is there a performance advantage?
Any query that you can write using only subqueries in the FROM clause and regular joins can use CTEs with direct substitution.
Subqueries are needed for:
Correlated subqueries (which are generally not in the FROM clause).
Lateral joins (in databases that support LATERAL or APPLY keywords in the FROM clause).
Scalar subqueries.
Sometimes, a query could be rewritten to avoid these constructs.
Subqueries in the FROM clause (except for lateral joins) can be written using CTEs.
Why are subqueries used and not CTEs? The most important reason is that CTEs are a later addition to the SQL language. With the exception of recursive CTEs, they are not really needed. They are really handy when a subquery is being referenced more than one time, but one could argue that a view serves the same purpose.
As mentioned in the comments, CTEs and subqueries might be optimized differently. This could be a performance gain or loss, depending on the query and the underlying indexes and so on.
Unless your query plan tells you that subquery performance is better than CTE otherwise I would use CTE instead of a subquery.
In particular, is there a performance advantage?
subquery vs simple (non-recursive) CTE versions, they are probably very similar. You would have to use the profiler and actual execution plan to spot any differences.
There are some reason I would use CTE
In general, CTE can be used recursively but subquery cannot make it, which can help you make a calendar table or especially well suited to tree structures.
CTE will be easier to maintain and read as (#Lukasz Szozda comment), because you can break up complex queries into several CTEs and give them good names, which will be very comfortable when writing in the main query.
Without performance considerations:
CTEs are more readable as sql code, meaning easier to maintain and debug.
Subqueries (at the FROM clause) are good as long as there are few, small and simple, thus converting to CTE would actually make it more difficult to read.
There is also the option of views which mostly prevents sql code duplication.
With performance considerations:
CTEs may screw up the more complex they become. If so, they become too risky to be trusted with some teaks and changes and may lead to a more aggressive performance approach like converting all CTEs to temps (#).
Subqueries behave as good as views and little better than CTEs in most cases. Still becoming too complex may hinter performance and turn performance optimization difficult. Eventually someone may need to tweak them or even extract the heavier(s) out to temps to lighten the main select.
Views are slightly better on increased complexity as long as they are composed of plain tables and simple views, they have elegant SQL and possible filters are linked wherever possible within view's joins. Still joining two complex views will get you to the same situation as complex CTEs or subqueries.
Using more columns within a CTE query affects the performance? I am currently trying to execute a query with the WITH sentence, and it seams that if I use more colum,s, it takes more time to load the data. Am I correct?
The number of columns defined in a CTE should have no effect on the actual performance of the query (it might affect the compile-time, which is generally miniscule).
Why? Because SQL Server "embeds" the code for the CTE in the query itself and then optimizes all the code together. Unused columns should be eliminated.
This might be an over generalization. There might be some cases where SQL Server doesn't eliminate the work for columns -- such as extra aggregation functions in an aggregation query or certain subqueries. But, in general, what is important is how the CTE is used, not how many columns are defined in it.
You can think of CTE as a View but it doesnt materialize to Disk.So A view expands it definition at run time ,same goes for CTE.
Are there rules of thumb for developers when to use join instead of subquery or are they the same.
The first principle is "State the query accurately". The second principle is "state the query simply and obviously" (which is where you usually make choices). The third is "state the query so it will process efficiently".
If its a dbms with a good query processor, equivalent query designs should should result in query plans that are the same (or at least equally efficient).
My greatest frustration upon using MySQL for the first time was how conscious I had to be to anticipate the optimizer. After long experience with Oracle, SQL Server, Informix, and other dbms products, I very seldom expected to concern myself with such issues. It's better now with newer versions of MySQL, but it's still something I end up needing to pay attention to more often than with the others.
Performance-wise, they don't have any difference in most modern DB engines.
Problem with subqueries is that you might end having a sub-resultset without any key, so joining them would be more expensive.
If possible, always try to make JOIN queries and filter with ON clause, instead of WHERE (although it should be the same, as modern engines are optimized for this).
Depends on RDBMS. You should compare execution plans for both queries.
In my experience with Oracle 10 and 11, execution plans are always the same.
Theoretically every subquery can be changed to a join query.
As with many things, it depends.
- how complex is the subquery
- in a query how often is the subquery executed
I try to avoid subqueries whenever I can. Especially when expecting large result sets never use subqueries - in case the subquery is executed for each item of the result set.
take care,
Alex
Let's ignore the performance impact for now (as we should if we are aware that "Premature optimization is the root of all evil").
Choose what looks clearer and easier to maintain.
In SQL Server a correlated subquery usually performs worse than a join or, often even better for performance, a join to a derived table. I almost never write a subquery for anything that will have to be performed multiple times. This is because correlated subqueries often basically turn your query into a cursor and run one row at a time. In databases it is usually better to do things in a set-based fashion
I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.