Common table expressions (CTEs) with large tables - sql

Considering a query of the following format that uses CTEs:
WITH
t1 AS (SELECT some_data1 FROM some_table),
t2 AS (SELECT some_data2 FROM t1)
SELECT some_data3 FROM t2;
Question 1:
When the query is executed does a temporary table t1 get built entirely and saved in memory, then t2 is built entirely based on the data from t1, then the SELECT can run against t2?
Question 2:
If t1 and t2 are large tables that cannot be stored in memory will they be written to disk making the query slower?
Question 3:
Should this type of query be avoided for large tables?

Answers:
Yes. Up to PostgreSQL v11, CTEs are materialized in PostgreSQL. This changes in v12, and from that version on your query will probably perform better.
You can EXPLAIN the query to verify that.
Yes.
Yes.

No. You can actually add more cte and not use them in the select at the bottom and they have no effect. The query optimizer turns them into the most efficient joins possible and executes it all together. For this reason CTEs are better and faster than temp tables.
This could be a problem with temp tables but no problem for CTEs. CTEs are just expressions representing the data a do not get called until the optimizer knows How you are selecting them.
Nope. In fact this is the way to go instead of tempt tables if your tables are large. Table size should not matter if you have proper indexes set up anyways. CTEs make it so you don’t have to process records that are just going to be filtered out later in the query anyways.

Related

Does using subquery to select specific columns in from clause efficient?

Suppose table1 has column a,b,c,.....z.
Does selecting specific columns using sub-query in from clause perform better than
just selecting all(*)?
OR does it result in more calculations?
( I am using HIVE)
A)
select
table1.a,
table1.b,
table2.aa,
table2.bb
FROM (SELECT table1.a
,table1.b
FROM table1
)
join table2 on (table1.b = table2.b)
B)
select
table1.a,
table1.b,
table2.aa,
table2.bb
FROM table1
join table2 on (table1.b = table2.b)
Thanks
Using a subquery to select specific columns is generally a bad idea, regardless of the database. However, in most databases, it make no difference to performance.
Why is it a bad idea? Basically, it can confuse the optimizer and/or anyone reading the query. Depending on the database,
It makes queries a bit harder to maintain (adding a new column can require repeating the column name over and over in subqueries).
The subqueries might be materialized (this should not occur in Hive).
Pruning partitions may not work if the pruning is in outer queries.
It may preclude the use of indexes (does not apply to Hive).
It may confuse the optimizer and statistics (probably does not apply to Hive).
I cannot think of a good reason for actually complicating a query by introducing unnecessary subqueries for this purpose.

How should I refactor subqueries

I am aware of three ways of writing subqueries into a query.
Normal subquery
With clause
Temporary table
Subqueries get extremely messy when there are multiples of them in a query, and especially when you have nested subqueries.
With clause is my preference, but you can only use the subqueries in a WITH clause in the select statement which directly follows the WITH clause (I believe).
Temporary tables are good, but they require quite a bit of over head in declaring the table.
Are there any other ways to refactor subqueries other than these?
And are there any trade offs between them that I haven't considered?
You are leaving out some other capabilities.
The most obvious is views. If you have a complex query that is going to be used multiple times -- particularly one that might be implementing business rules between tables -- then a view is very useful. If performance is an issue, then you can materialize the view.
A common use of subqueries is to generate additional columns in a table -- such as the difference between two columns. You can use computed columns for these calculations, and make them part of the data definition.
Finally, you could implement user-defined functions. user-defined table-valued functions are a lot like views with parameters. This can be really helpful under some circumstances. And the underlying queries should generally be quite optimized.
Another type of user-defined functions are scalar functions. These usually incur more overhead, but can be quite useful at times.
All that said, if you structure your queries cleanly, then subqueries and CTEs won't look "messy". You might actually find that you can write very long queries that make sense both to you and to other people reading them.
As a matter of preference and readability more than performance, with is probably the best.
I don't know which database you are using, but in Oracle the with create a temporary view/table accessible with the name on the LHS of the as and is not really distinct from a subquery: this name may be used like it were a normal table.
The select * from (select * from a) is doing the same: the only matter is that you can not reuse that result:
select * from (subquery1) q left join t1 on t1.id = q.id
union all
select * from (subquery1) q left join t2 on t2.id = q.id;
But that is where the query plan is important: subquery1 is the same in both case and the plan may be one that use a temporary table/view, thus reducing the cost of whole.
The with is ultimately a way to create temporary table/view and also force the plan optimizer to build query in some order which may (not) be best.
Temporary table would be good if you know the result would be reused later, not in the same query (in which case the with does the same work, given the temporary table it use) and even transaction (example: saving the result of a search):
begin
insert into tmp (...);
select * from tmp q left join t1 on t1.id = q.id;
select * from tmp q left join t2 on t2.id = q.id;
end;
The tmp table is used twice in the same transaction but not in the same query: your database won't recompute the result twice and it is probably fine if all you are doing are select (no mutation on tmp source).

Performance of "NOT IN" in SQL query

I'm quite new to SQL query analysis. Recently I stumbled upon a performance issue with one of the queries and I'm wondering whether my thought process is correct here and why Query Optimizer works the way it works in this case.
I'm om SQL Server 2012.
I've got a SQL query that looks like
SELECT * FROM T1
WHERE Id NOT IN
(SELECT DISTINCT T1_Id from T2);
It takes around 30 seconds to run on my test server.
While trying to understand what is taking so long I rewrote it using a temp table, like this:
SELECT DISTINCT T1_Id
INTO #temp from T2;
SELECT * FROM T1
WHERE Id NOT IN
(SELECT T1_Id from #temp);
It runs a hundred times faster than the first one.
Some info about the tables:
T2 has around 1 million rows, and there are around 1000 distinct values of T1_id there. T1 has around 1000+ rows. Initially I only had a clustered index on T2 on a column other than T1_Id, so T1_id wasn't indexed at all.
Looking at the execution plans, I saw that for the first query there were as many index scans as there are distinct T1_id values, so basically SQL Server performs about 1000 index scans in this case.
That made me realize that adding a non-clustered index on T1_id may be a good idea (the index should've been there from the start, admittedly), and adding an index indeed made the original query run much faster since now it does nonclustered index seeks.
What I'm looking for is to understand the Query optimizer behavior for the original query - does it look reasonable? Are there any ways to make it work in a way similar to the temporary table variant that I posted here rather than doing multiple scans? Am I just misunderstanding something here?
Thanks in advance for any links to the similar discussion as I haven't really found anything useful.
Not in is intuitive but slow. This construct will generally run quicker.
where id in
(select id from t1
except select t1_id from t2)
The actual performance will likely vary from the estimates, but neither of your queries will out-perform this query, which is the de facto standard approach:
SELECT T1.* FROM T1
LEFT JOIN T2 ON T1.Id = T2.T1_Id
WHERE T2.T1_Id IS NULL
This uses a proper join, which will perform very well (assuming the foreign key column is indexed) and being an left (outer) join the WHERE condition selects only those rows from T1 that don't join (all columns of the right side table are null when the join misses).
Note also that DISTINCT is not required, since there is only ever one row returned from T1 for missed joins.
The SQL Server optimizer needs to understand the size if tables for some of its decisions.
When doing a NOT IN with a subquery, those estimates may not be entirely accurate. When the table is actually materialized, the count would be highly accurate.
I think the first would be faster with an index on
Table2(t1_id)
This is just a guess, but hopefully an educated one...
The DBMS probably concluded that searching a large table small number of times is faster than searching a small table large number of times. That's why you had ~1000 searches on T2, instead of ~1000000 searches on T1.
When you added an index on T2.T1_Id, that turned ~1000 table scans (or full clustered index scans if the table is clustered) into ~1000 index seeks, which made things much faster, as you already noted.
I'm not sure why it didn't attempt a hash join (or a merge join after the index was added) - perhaps it had stale statistics and badly overestimated the number of distinct values?
One more thing: is there a FOREIGN KEY on T2.T1_Id referencing T1.Id? I know Oracle can use FKs to improve the accuracy of cost estimates (in this case, it could infer that the cardinality of T2.T1_Id cannot be greater than T1.Id). If MS SQL Server does something similar, and the FK is missing (or is untrusted), that could contribute to the MS SQL Server thinking there are more distinct values than there really are.
(BTW, it would have helped if you posted the actual query plans and the database structure.)

Inner joins involving three tables

I have a SELECT statement that has three inner joins involving two tables.
Apart from creating indexes on the columns referenced in the ON and WHERE clauses, is there other things I can do to optimize the joins, as in rewriting the query?
SELECT
...
FROM
my_table AS t1
INNER JOIN
my_table AS t2
ON
t2.id = t1.id
INNER JOIN
other_table AS t3
ON
t2.id = t3.id
WHERE
...
You can tune PostgreSQL config, VACUUM ANALIZE and all general optimizations.
If this is not enough and you can spend few days you may write code to create materialized view as described in postgresql wiki.
You likely have an error in your example, because you're selecting the same record from my_table twice, you could really just do:
SELECT
...
FROM
my_table AS t1
INNER JOIN
other_table AS t3
ON
t1.id = t3.id
WHERE
...
Because in your example code t1 Will always be t2.
But lets assume you mean ON t2.idX = t1.id; then to answer your question, you can't get much better performance than what you have, you could index them or you could go further and define them as foreign key relationships (which wouldn't do too much in terms of performance benefits compared to non-index vs indexing them).
You might instead like to look at restricting your where clause and perhaps that is where your indexing would be as (if not more) beneficial.
You could write your query as using WHERE EXISTS (if you don't need to select data from all three tables) rather than INNER JOINS but the performance will be almost identical (except when this is itself inside a nested query) as it still needs find the records.
In PostgreSQL. most of your tuning will not be on the actual query. The goal is to help the optimizer figure out how best to execute your declarative query, not to specify how to do it from your program. That isn't to say that sometimes queries can't be optimized themselves, or that they might not need to be, but this doesn't have any of the problem areas that I am aware of, unless you are retrieving a lot more records than you need to (which I have seen happen occasionally).
The fist thing to do is to run vacuum analyze to make sure you have optimal statistics. Then use explain analyze to compare expected query performance to actual. From that point, we'd look at indexes etc. There isn't anything in this query that needs to be optimized on a query level. However without looking at your actual filters in your where clause and the actual output of explain analyze there isn't much that can be suggested.
Typically you tweak the db to choose a better query plan rather than specifying it in your query. That's usually the PostgreSQL way. The comment is of course qualified by noting there are exceptions.

In which sequence are queries and sub-queries executed by the SQL engine?

Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.
I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.
Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ
The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.
If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.
It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.