How should I refactor subqueries - sql

I am aware of three ways of writing subqueries into a query.
Normal subquery
With clause
Temporary table
Subqueries get extremely messy when there are multiples of them in a query, and especially when you have nested subqueries.
With clause is my preference, but you can only use the subqueries in a WITH clause in the select statement which directly follows the WITH clause (I believe).
Temporary tables are good, but they require quite a bit of over head in declaring the table.
Are there any other ways to refactor subqueries other than these?
And are there any trade offs between them that I haven't considered?

You are leaving out some other capabilities.
The most obvious is views. If you have a complex query that is going to be used multiple times -- particularly one that might be implementing business rules between tables -- then a view is very useful. If performance is an issue, then you can materialize the view.
A common use of subqueries is to generate additional columns in a table -- such as the difference between two columns. You can use computed columns for these calculations, and make them part of the data definition.
Finally, you could implement user-defined functions. user-defined table-valued functions are a lot like views with parameters. This can be really helpful under some circumstances. And the underlying queries should generally be quite optimized.
Another type of user-defined functions are scalar functions. These usually incur more overhead, but can be quite useful at times.
All that said, if you structure your queries cleanly, then subqueries and CTEs won't look "messy". You might actually find that you can write very long queries that make sense both to you and to other people reading them.

As a matter of preference and readability more than performance, with is probably the best.
I don't know which database you are using, but in Oracle the with create a temporary view/table accessible with the name on the LHS of the as and is not really distinct from a subquery: this name may be used like it were a normal table.
The select * from (select * from a) is doing the same: the only matter is that you can not reuse that result:
select * from (subquery1) q left join t1 on t1.id = q.id
union all
select * from (subquery1) q left join t2 on t2.id = q.id;
But that is where the query plan is important: subquery1 is the same in both case and the plan may be one that use a temporary table/view, thus reducing the cost of whole.
The with is ultimately a way to create temporary table/view and also force the plan optimizer to build query in some order which may (not) be best.
Temporary table would be good if you know the result would be reused later, not in the same query (in which case the with does the same work, given the temporary table it use) and even transaction (example: saving the result of a search):
begin
insert into tmp (...);
select * from tmp q left join t1 on t1.id = q.id;
select * from tmp q left join t2 on t2.id = q.id;
end;
The tmp table is used twice in the same transaction but not in the same query: your database won't recompute the result twice and it is probably fine if all you are doing are select (no mutation on tmp source).

Related

Does using subquery to select specific columns in from clause efficient?

Suppose table1 has column a,b,c,.....z.
Does selecting specific columns using sub-query in from clause perform better than
just selecting all(*)?
OR does it result in more calculations?
( I am using HIVE)
A)
select
table1.a,
table1.b,
table2.aa,
table2.bb
FROM (SELECT table1.a
,table1.b
FROM table1
)
join table2 on (table1.b = table2.b)
B)
select
table1.a,
table1.b,
table2.aa,
table2.bb
FROM table1
join table2 on (table1.b = table2.b)
Thanks
Using a subquery to select specific columns is generally a bad idea, regardless of the database. However, in most databases, it make no difference to performance.
Why is it a bad idea? Basically, it can confuse the optimizer and/or anyone reading the query. Depending on the database,
It makes queries a bit harder to maintain (adding a new column can require repeating the column name over and over in subqueries).
The subqueries might be materialized (this should not occur in Hive).
Pruning partitions may not work if the pruning is in outer queries.
It may preclude the use of indexes (does not apply to Hive).
It may confuse the optimizer and statistics (probably does not apply to Hive).
I cannot think of a good reason for actually complicating a query by introducing unnecessary subqueries for this purpose.

Clarification about Select from (select...) statement

I came across a SQL practice question. The revealed answer is
SELECT ROUND(ABS(a - c) + ABS(b - d), 4) FROM (
SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d
FROM station);
Normally, I would enocunter
select[] from[] where [] (select...)
which to imply that the selected variable from the inner loop at the where clause will determine what is to be queried in the outer loop. As mentioned at the beginning, this time the select is after
FROM to me I'm curious the functionality of this. Is it creating an imaginary table?
The piece in parentheses:
(SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d FROM station)
is a subquery.
What's important here is that the result of a subquery looks like a regular table to the outer query. In some SQL flavors, an alias is necessary immediately following the closing parenthesis (i.e. a name by which to refer to the table-like result).
Whether this is technically a "temporary table" is a bit of a detail as its result isn't stored outside the scope of the query; and there is an also a thing called a temporary table which is stored.
Additionally (and this might be the source of confusion), subqueries can also be used in the WHERE clause with an operator (e.g. IN) like this:
SELECT student_name
FROM students
WHERE student_school IN (SEELCT school_name FROM schools WHERE location='Springfield')
This is, as discussed in the comments and the other answer a subquery.
Logically, such a subquery (when it appears in the FROM clause) is executed "first", and then the results treated as a table1. Importantly though, that is not required by the SQL language2. The entire query (including any subqueries) is optimized as a whole.
This can include the optimizer doing things like pushing a predicate from the outer WHERE clause (which, admittedly, your query doesn't have one) down into the subquery, if it's better to evaluate that predicate earlier rather than later.
Similarly, if you had two subqueries in your query that both access the same base table, that does not necessarily mean that the database system will actually query that base table exactly twice.
In any case, whether the database system chooses to materialize the results (store them somewhere) is also decided during the optimization phase. So without knowing your exact RDBMS and the decisions that the optimizer takes to optimize this particular query, it's impossible to say whether it will result in something actually being stored.
1Note that there is no standard terminology for this "result set as a table" produced by a subquery. Some people have mentioned "temporary tables" but since that is a term with a specific meaning in SQL, I shall not be using it here. I generally use the term "result set" to describe any set of data consisting of both columns and rows. This can be used both as a description of the result of the overall query and to describe smaller sections within a query.
2Provided that the final results are the same "as if" the query had been executed in its logical processing order, implementations are free to perform processing in any ordering they choose to.
As there are so many terms involved, I just thought I'll throw in another answer ...
In a relational database we deal with tables. A query reads from tables and its result again is a table (albeit not a stored one).
So in the FROM clause we can access query results just like any stored table:
select * from (select * from t) x;
This makes the inner query a subquery to our main query. We could also call this an ad-hoc view, because view is the word we use for queries we access data from. We can move it to the begin of our main query in order to enhance readability and possibly use it multiple times in it:
with x as (select * from t) select * from x;
We can even store such queries for later access:
create view v as select * from t;
select * from v;
In the SQL standard these terms are used:
BASE TABLE is a stored table we create with CREATE TABLE .... t in above examples is supposed to be a base table.
VIEWED TABLE is a view we create with CREATE VIEW .... v above examples is a viewed table.
DERIVED TABLE is an ad-hoc view, such as x in the examples above.
When using subqueries in other clauses than FROM (e.g. in the SELECT clause or the WHERE clause), we don't use the term "derived table". This is because in these clauses we don't access tables (i.e. something like WHERE mytable = ... does not exist), but columns and expression results. So the term "subquery" is more general than the term "derived table". In those clauses we still use various terms for subqueries, though. There are correlated and non-correlated subqueries and scalar and non-scalar ones.
And to make things even more complicated we can use correlated subqueries in the FROM clause in modern DBMS that feature lateral joins (sometimes implemented as CROSS APPLY and OUTER APPLY). The standard calls these LATERAL DERIVED TABLES.

Querying Table vs Querying Subquery of Same Table

I'm working with some very old legacy code, and I've seen a few queries that are structured like this
SELECT
FieldA,
FieldB,
FieldC
FROM
(
SELECT * FROM TABLE1
)
LEFT JOIN TABLE2 ON...
Is there any advantage to writing a query this way?
This is in Oracle.
There would seem to be no advantage to using a subquery like this. The reason may be a historical relic, regarding the code.
Perhaps once upon a time, there was a more complicated query there. The query was replaced by a table/view, and the author simply left the original structure.
Similarly, once upon a time, perhaps a column needed to be calculated (say for the outer query or select). This column was then included in the table/view, but the structure remained.
I'm pretty sure that Oracle is smart enough to ignore the subquery when optimizing the query. Not all databases are that smart, but you might want to clean-up the code. At the very least, such as subquery looks awkward.
As a basic good practice in SQL, you should not code a full-scan from a table (SELECT * FROM table, without a WHERE clause), unless necessary, for performance issues.
In this case, it's not necessary: The same result can be obtained by:
SELECT
Fields
FROM
TABLE1 LEFT JOIN TABLE2 ON...

Oracle performance questions, inner selects in joins, temporary WITH tables indexes

I would like to consult three aspects of performance (Oracle 11g).
1./ If I define temporary table by keyword "WITH" like
WITH tbl AS (
SELECT [columns from both tables...]
FROM table_with_inexes
JOIN other_table ...
)
SELECT ...
FROM tbl
JOIN xxx ON tbl.column = xxx.column
is subsequent select on that temporary table able to use indexes, that was defined on table_with_inexes and other_table?
2./ Is it possible to add indexes to temporary table created by "WITH" in that above-like single SQL command?
3./ When I have construct such as this:
...
LEFT JOIN (
SELECT indexedColumn, otherColumns
FROM table
JOIN other_table
GROUP BY ...
) C
ON (outerTable.indexedColumn = C.indexedColumn)
in which cases could Oracle use indexes on indexedColumn? I assume, that the select in LEFT JOIN is only "projection" that does not maintain indexes, so the join's ON clausule evaluation is evaluated without using indexes?
The WITH clause (or subquery factoring as it's known as) is just a means of creating aliases for subqueries. It's most useful when you have multiple copies of the same subquery in your query, in which case Oracle may or may not choose to create a temporary table for it behind the scenes (aka "materialize" it). You should read up on this - here's a good link.
To answer your questions:
1) If the indexes are available to be used (no functions on the columns involved, selecting a small percentage of the data etc, etc) then they'll be used, just like in any other query.
2) You can't add indexes to the subquery. Not even to the temporary table that Oracle might create behind the scenes; you have no control over that.
3) I suggest you read up about when indexes might or might not be used. Try http://www.orafaq.com/node/1403 or http://www.orafaq.com/tuningguide/not%20using%20index.html, or perform your own google search.
WITH clause might be either inlined or materialized. It's up to Oracle to decide which approach is better. In your case most probably both queries will have the same execution plan(will be inlined)
PS: even if the table is materialized, indexes can not be added, Oracle can not do that. On the other hand in most cases it is not even necessary, the table can be materialized as a hash table(not heap table) or full table scan is used on it.

Inner joins involving three tables

I have a SELECT statement that has three inner joins involving two tables.
Apart from creating indexes on the columns referenced in the ON and WHERE clauses, is there other things I can do to optimize the joins, as in rewriting the query?
SELECT
...
FROM
my_table AS t1
INNER JOIN
my_table AS t2
ON
t2.id = t1.id
INNER JOIN
other_table AS t3
ON
t2.id = t3.id
WHERE
...
You can tune PostgreSQL config, VACUUM ANALIZE and all general optimizations.
If this is not enough and you can spend few days you may write code to create materialized view as described in postgresql wiki.
You likely have an error in your example, because you're selecting the same record from my_table twice, you could really just do:
SELECT
...
FROM
my_table AS t1
INNER JOIN
other_table AS t3
ON
t1.id = t3.id
WHERE
...
Because in your example code t1 Will always be t2.
But lets assume you mean ON t2.idX = t1.id; then to answer your question, you can't get much better performance than what you have, you could index them or you could go further and define them as foreign key relationships (which wouldn't do too much in terms of performance benefits compared to non-index vs indexing them).
You might instead like to look at restricting your where clause and perhaps that is where your indexing would be as (if not more) beneficial.
You could write your query as using WHERE EXISTS (if you don't need to select data from all three tables) rather than INNER JOINS but the performance will be almost identical (except when this is itself inside a nested query) as it still needs find the records.
In PostgreSQL. most of your tuning will not be on the actual query. The goal is to help the optimizer figure out how best to execute your declarative query, not to specify how to do it from your program. That isn't to say that sometimes queries can't be optimized themselves, or that they might not need to be, but this doesn't have any of the problem areas that I am aware of, unless you are retrieving a lot more records than you need to (which I have seen happen occasionally).
The fist thing to do is to run vacuum analyze to make sure you have optimal statistics. Then use explain analyze to compare expected query performance to actual. From that point, we'd look at indexes etc. There isn't anything in this query that needs to be optimized on a query level. However without looking at your actual filters in your where clause and the actual output of explain analyze there isn't much that can be suggested.
Typically you tweak the db to choose a better query plan rather than specifying it in your query. That's usually the PostgreSQL way. The comment is of course qualified by noting there are exceptions.