In which sequence are queries and sub-queries executed by the SQL engine? - sql

Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.

I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.

Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ

The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.

If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.

It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.

Related

Is there performance impact when Non-Aggregate SQL functions are used in a SELECTed Column?

We have a report that uses a long and complex query that has the SELECT statement like below:
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
NVL(NrDostawcy,'BRAK') supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
We recently made modifications to this Query and just modified the 3rd SELECTed column to add a REGEXP_LIKE
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
--NVL(NrDostawcy,'BRAK') supplier_registration,
Case When (NrDostawcy is not null and regexp_like(substr(NrDostawcy,1,2),'^[a-zA-Z]*$')) Then substr(NrDostawcy,3) else NVL(NrDostawcy,'BRAK') End supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
I checked the Explain Plans of both queries and they turned out to have the same Plan hash value.
Does this mean there's no impact on performance if i use Seeded, non-aggregate, SQL Functions in SELECTed columns?
I believe there is an impact in performance if they're used in the WHERE clause, but i wasn't sure if the same applies to the SELECTed columns.
Apologies in advance as i can't provide the exact query since it's propietary and is very long and complex.
I also don't think I can create a good enough sample that would match the Explain plan of actual query as it joins over 10 tables, with thousand rows of data.
Thank you!
Since you are running this query on Oracle here's my advice. Run the query with Oracle hint /*+ gather_plan_statistics */. Run it with the first query without regex and with the regex. Then find this query in sharedpool (v$sql). The hint will give you the exact buffer gets, physical reads an also time spent in each step of the plan. With that data you can analyze in details how much more time query with regex needed to execute. I advice you, that you do this on data that returns you more than lets say 10k rows. In this way the difference should be seen (if you run this with 100 rows no difference will be seen).
The execution plan is the same as it needs to query exactly the same data from the same tables. You should also see the amount of data (logical IO) unchanged.
What will not be the same however is the execution time, as the regexp_like will consume more CPU, even if you see the logical IO unchanged.
Note that if you changed the selected columns, the execution plan could change as if all selected columns were part of an index, the optimizer might skip the table access and read the data from an index only.
it depends upon the query and the IO's being done to get the data. Sometimes you can try creating a Oracle Function based index, you may see some improvements.
Check this link, it could help you.
https://jeffkemponoracle.com/2007/11/will-oracle-use-my-regexp-function-based-index/
thanks

Common table expressions (CTEs) with large tables

Considering a query of the following format that uses CTEs:
WITH
t1 AS (SELECT some_data1 FROM some_table),
t2 AS (SELECT some_data2 FROM t1)
SELECT some_data3 FROM t2;
Question 1:
When the query is executed does a temporary table t1 get built entirely and saved in memory, then t2 is built entirely based on the data from t1, then the SELECT can run against t2?
Question 2:
If t1 and t2 are large tables that cannot be stored in memory will they be written to disk making the query slower?
Question 3:
Should this type of query be avoided for large tables?
Answers:
Yes. Up to PostgreSQL v11, CTEs are materialized in PostgreSQL. This changes in v12, and from that version on your query will probably perform better.
You can EXPLAIN the query to verify that.
Yes.
Yes.
No. You can actually add more cte and not use them in the select at the bottom and they have no effect. The query optimizer turns them into the most efficient joins possible and executes it all together. For this reason CTEs are better and faster than temp tables.
This could be a problem with temp tables but no problem for CTEs. CTEs are just expressions representing the data a do not get called until the optimizer knows How you are selecting them.
Nope. In fact this is the way to go instead of tempt tables if your tables are large. Table size should not matter if you have proper indexes set up anyways. CTEs make it so you don’t have to process records that are just going to be filtered out later in the query anyways.

MS Access 2010 SQL Top N query by group performance issue (continued)

I have signficant performcance issues (up to time-out) in MS Access 2010 with the query below. The table TempTableAnalysis contains between 10'000-15'000 records. I have already received input from this forum to work with a temporary table in the top 10 query (MS Access 2010 SQL Top N query by group performance issue)
Can anyone explain how to implement the temporary table in the subquery and how to join it? I can't get it to work.
Any other suggestions to improve performance are highly appreciated.
Here is my query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
Optimizing Access Query Performance For Large Data Sets
Based on your posted SQL Query, you have some options available to optimize and speed up the performance.
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
...
This is the first part where TempTableAnalysis is the multi-thousand record subquery. If you want to squeeze a little more performance out of the use of this "Temp" Table, don't use it as a dynamic query (i.e., calculated on demand each time the query is opened), try constructing a macro that pushes the output to a static table:
Appending Subquery Data to a Static Table:
Create a QUERY object and change its type to DELETE. Design it to delete the contents of your "temporary" table object. If you prefer using SQL, the command will look like:
DELETE My_Table.*
FROM My_Table;
Create a QUERY object and change its type to APPEND. Design it to query all fields from your query defined by the SQL statement of this OP. Again, the SQL version of this task has the following syntax:
INSERT INTO StaticAnalysisTable ( ID, Loc, Item, AvgOfScaledError )
SELECT t1.ID, t1.Loc, t1.Item, t1.AvgOfScaledError
FROM TempTableAnalysis as t1;
The next step is to automate the population of this static table and it is optional. It's simple however and will make it less likely that you will make the mistake of forgetting to "Refresh" and accessing your static table while it has stale data... causing inaccuracies in your results.
Create a macro with two steps. Each step will have the following definition: OPEN QUERY. When prompted for the query to open, reference the objects you created in the previous two steps in the following order (important): (1) DELETE Query: (your delete query name) then (2) APPEND Query: (your append query name).
SQL Query Comments and Suggestions
The following part of the posted SQL query could use some help:
...
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
There is a join across the sub query that generates the TOP-10 data and the outermost query that correlates these results with the supplementing MASE table data. This isn't necessary if the TempTableAnalysis.MASE represents a key value.
ORDER BY
in the inner most query isn't necessary unless it is intended to force some sort of selection criteria (as in when using SQL analytical functions) this doesn't look like one of those cases. Ordering records from large data sets is also a wasteful cpu and memory sink.
EDIT: Just as a counter-point argument, the ORDER BY clause used beside a TOP N query actually has a purpose, but I am still not clear if it is necessary. Just to round out the discussion, another SO thread talks about How to Select Top 10 in an Access Query.
WHERE t2.MASE IN (...
You may be experiencing blocks in performance with very large in-list set operations. On an Oracle database server, I have discovered with other developers that there is a limitation to the number of discrete elements in an in-list query operator. That value was in the thousands... which may be further limited based on server and database resources.
Consider using a SQL JOIN operator. The place where you define TABLE objects can also be populated with SQL defined queries with aliases known as INLINE VIEWS. Since you're using ACCESS, if an inline view does not work directly, just define another ACCESS QUERY object and reference it in your final query as if it were a table...
A possible rewrite to the ending part of the original query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
...
FROM TempTableAnalysis AS t2,
(SELECT TOP 10 t1.MASE, t1.ABCByPick
FROM TempTableAnalysis AS t1) AS ttop
WHERE t2.MASE = ttop.MASE
AND t2.ABCByPick = ttop.ABCByPick
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
You will definitely need to run through these recommendations and validate the output data for accuracy. This represents approaches to capturing some of the "low-hanging fruit" (easy items) that you can pursue to speed up your query and reporting operations.
Conclusions and Closing Comments
As a background to other readers, the database object TempTableAnalysis is not a static table. It is the result of a sub query presented in another SO post requesting help with a Access TOP N Query. The query comes from multiple tables approaching 10,000 records in size (each?).
Tip: A query result in Access ALSO has potential table-like behaviors. You can push the output to a table for joining (as described above) or just join to the query object itself (careful though, especially when you get to "chaining" multiple query operations...)
The strategy of this solution was:
To minimize the number of trips through one or more instances of this very large table.
To pre-process and index optimize any data that would otherwise be "static" for the duration of its analysis.
To audit and review the SQL code used to obtain the final results.
Definitely look into Access MACROS. Coupled with identifying static data in your data sets, you can offload processing of your complex background analytic queries to improve the user experience when they view and query through the final results. Good Luck!

Why select Top clause could lead to long time cost

The following query takes forever to finish. But if I remove the top 10 clause, it finishs rather quickly. big_table_1 and big_table_2 are 2 tables with 10^5 records.
I used to believe that top clause will reduce the time cost, but it's apparently not here. Why???
select top 10 ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
There are other stackoverflow discussions on this same topic (links at bottom). As noted in the comments above it might have something to do with indexes and the optimizer getting confused and using the wrong one.
My first thought is that you are doing a select top serviceid from (select *....) and the optimizer may have difficulty pushing the query down to the inner queries and making using of the index.
Consider rewriting it as
select top 10 ServiceRequestID
from big_table_1
inner join big_table_2 cap2
on cap1.servicerequestid = cap2.customerreferencenumber
and big_table_1.statusid = 2
In your query, the database is probably trying to merge the results and return them and THEN limit it to the top 10 in the outer query. In the above query the database will only have to gather the first 10 results as results are being merged, saving loads of time. And if servicerequestID is indexed, it will be sure to use it. In your example, the query is looking for the servicerequestid column in a result set that has already been returned in a virtual, unindexed format.
Hope that makes sense. While hypothetically the optimizer is supposed to take whatever format we put SQL in and figure out the best way to return values every time, the truth is that the way we put our SQL together can really impact the order in which certain steps are done on the DB.
SELECT TOP is slow, regardless of ORDER BY
Why is doing a top(1) on an indexed column in SQL Server slow?
I had a similar problem with a query like yours. The query ordered but without the top clause took 1 sec, same query with top 3 took 1 minute.
I saw that using a variable for the top it worked as expected.
The code for your case:
declare #top int = 10;
select top (#top) ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
I cant explain why but I can give an idea:
try adding SET ROWCOUNT 10 before your query. It helped me in some cases. Bear in mind that this is a scope setting so you have to set it back to its original value after running your query.
Explanation:
SET ROWCOUNT: Causes SQL Server to stop processing the query after the specified number of rows are returned.
This can also depend on what you mean by "finished". If "finished" means you start seeing some display on a gui, that does not necessarily mean the query has completed executing. It can mean that the results are beginning to stream in, not that the streaming is complete. When you wrap this into a subquery, the outer query can't really do it's processing until all the results of the inner query are available:
the outer query is dependent on the length of time it takes to return the last row of the inner query before it can "finish"
running the inner query independently may only requires waiting until the first row is returned before seeing any results
In Oracle, there were "first_rows" and "all_rows" hints that were somewhat related to manipulating this kind of behaviour. AskTom discussion.
If the inner query takes a long time between generating the first row and generating the last row, then this could be an indicator of what is going on. As part of the investigation, I would take the inner query and modify it to have a grouping function (or an ordering) to force processing all rows before a result can be returned. I would use this as a measure of how long the inner query really takes for comparison to the time in the outer query takes.
Drifting off topic a bit, it might be interesting to try simulating something like this in Oracle: create a Pipelined function to stream back numbers; stream back a few (say 15), then spin for a while before streaming back more.
Used a jdbc client to executeQuery against the pipelined function. The Oracle Statement fetchSize is 10 by default. Loop and print the results with a timestamp. See if the results stagger. I could not test this with Postgresql (RETURN NEXT), since Postgres does not stream the results from the function.
Oracle Pipelined Function
A pipelined table function returns a row to its invoker immediately
after processing that row and continues to process rows. Response time
improves because the entire collection need not be constructed and
returned to the server before the query can return a single result
row. (Also, the function needs less memory, because the object cache
need not materialize the entire collection.)
Postgresql RETURN NEXT
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation.
JDBC Default Fetch Sizes
statement.setFetchSize(100);
When debugging things like this I find that the quickest way to figure out how SQL Server "sees" the two queries is to look at their query plans. Hit CTRL-L in SSMS in the query view and the results will show what logic it will use to build your results when the query is actually executed.
SQL Server maintains statistics about the data your tables, e.g. histograms of the number of rows with data in certain ranges. It gathers and uses these statistics to try to predict the "best" way to run queries against those tables. For example, it might have data that suggests for some inputs a particular subquery might be expected to return 1M rows, while for other inputs the same subquery might return 1000 rows. This can lead it to choose different strategies for building the results, say using a table scan (exhaustively search the table) instead of an index seek (jump right to the desired data). If the statistics don't adequately represent the data, the "wrong" strategy can be chosen, with results similar to what you're experiencing. I don't know if that's the problem here, but that's the kind of thing I would look for.
If you want to compare performances of your two queries, you have to run these two queries in the same situation ( with clean memory buffers ) and have mumeric statistics
Run this batch for each query to compare execution time and statistics results
(Do not run it on a production environment) :
DBCC FREEPROCCACHE
GO
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO
-- your query here
GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO
I've just had to investigate a very similar issue.
SELECT TOP 5 *
FROM t1 JOIN t2 ON t2.t1id = t1.id
WHERE t1.Code = 'MyCode'
ORDER BY t2.id DESC
t1 has 100K rows, t2 20M rows, The average number of rows from the joined tables for a t1.Code is about 35K. The actual resultset is only 3 rows because t1.Code = 'MyCode' only matches 2 rows which only have 3 corresponding rows in t2. Stats are up-to-date.
With the TOP 5 as above the query takes minutes, with the TOP 5 removed the query returns immediately.
The plans with and without the TOP are completely different.
The plan without the TOP uses an index seek on t1.Code, finds 2 rows, then nested loop joins 3 rows via an index seek on t2. Very quick.
The plan with the TOP uses an index scan on t2 giving 20M rows, then nested loop joins 2 rows via an index seek on t1.Code, then applies the top operator.
What I think makes my TOP plan so bad is that the rows being picked from t1 and t2 are some of the newest rows (largest values for t1.id and t2.id). The query optimiser has assumed that picking the first 5 rows from an evenly distributed average resultset will be quicker than the non-TOP approach. I tested this theory by using a t1.code from the very earliest rows and the response is sub-second using the same plan.
So the conclusion, in my case at least, is that the problem is a result of uneven data distribution that is not reflected in the stats.
TOP does not sort the results to my knowledge unless you use order by.
So my guess would be, as someone had already suggested, that the query isn't taking longer to execute. You simply start seeing the results faster when you don't have TOP in the query.
Try using #sql_mommy query, but make sure you have the following:
To get your query to run faster, you could create an index on servicerequestid and statusid in big_table_1 and an index on customerreferencenumber in big_table_2. If you create unclustered indexes, you should get an index only plan with very fast results.
If I remember correctly, the TOP results will be in the same order as the index you us on big_table_1, but I'm not sure.
Gísli
It might be a good idea to compare the execution plans between the two. Your statistics might be out of date. If you see a difference between the actual execution plans, there is your difference in performance.
In most cases, you would expect better performance in the top 10. In your case, performance is worse. If this is the case you will not only see a difference between the execution plans, but you will also see a difference in the number of returned rows in the estimated execution plan and the actual execution plan, leading to the poor decission by the SQL engine.
Try again after recomputing your statistics (and while you're at it, rebuilding indices)
Also check if it helps to take out the where big_table_1.StatusId=2 and instead go for
select top 10 ServiceRequestID
from big_table_1 as cap1 INNER JOIN
big_table_2 as cap2
ON cap1.ServiceRequestID = cap2.CustomerReferenceNumber
WHERE cap1.StatusId=2
I find this format much more readable, though it should (though remotely possibly it doesn't) optimise to the same execution plan. The returned endresult will be identical regardless

Inner joins involving three tables

I have a SELECT statement that has three inner joins involving two tables.
Apart from creating indexes on the columns referenced in the ON and WHERE clauses, is there other things I can do to optimize the joins, as in rewriting the query?
SELECT
...
FROM
my_table AS t1
INNER JOIN
my_table AS t2
ON
t2.id = t1.id
INNER JOIN
other_table AS t3
ON
t2.id = t3.id
WHERE
...
You can tune PostgreSQL config, VACUUM ANALIZE and all general optimizations.
If this is not enough and you can spend few days you may write code to create materialized view as described in postgresql wiki.
You likely have an error in your example, because you're selecting the same record from my_table twice, you could really just do:
SELECT
...
FROM
my_table AS t1
INNER JOIN
other_table AS t3
ON
t1.id = t3.id
WHERE
...
Because in your example code t1 Will always be t2.
But lets assume you mean ON t2.idX = t1.id; then to answer your question, you can't get much better performance than what you have, you could index them or you could go further and define them as foreign key relationships (which wouldn't do too much in terms of performance benefits compared to non-index vs indexing them).
You might instead like to look at restricting your where clause and perhaps that is where your indexing would be as (if not more) beneficial.
You could write your query as using WHERE EXISTS (if you don't need to select data from all three tables) rather than INNER JOINS but the performance will be almost identical (except when this is itself inside a nested query) as it still needs find the records.
In PostgreSQL. most of your tuning will not be on the actual query. The goal is to help the optimizer figure out how best to execute your declarative query, not to specify how to do it from your program. That isn't to say that sometimes queries can't be optimized themselves, or that they might not need to be, but this doesn't have any of the problem areas that I am aware of, unless you are retrieving a lot more records than you need to (which I have seen happen occasionally).
The fist thing to do is to run vacuum analyze to make sure you have optimal statistics. Then use explain analyze to compare expected query performance to actual. From that point, we'd look at indexes etc. There isn't anything in this query that needs to be optimized on a query level. However without looking at your actual filters in your where clause and the actual output of explain analyze there isn't much that can be suggested.
Typically you tweak the db to choose a better query plan rather than specifying it in your query. That's usually the PostgreSQL way. The comment is of course qualified by noting there are exceptions.