Querying Table vs Querying Subquery of Same Table - sql

I'm working with some very old legacy code, and I've seen a few queries that are structured like this
SELECT
FieldA,
FieldB,
FieldC
FROM
(
SELECT * FROM TABLE1
)
LEFT JOIN TABLE2 ON...
Is there any advantage to writing a query this way?
This is in Oracle.

There would seem to be no advantage to using a subquery like this. The reason may be a historical relic, regarding the code.
Perhaps once upon a time, there was a more complicated query there. The query was replaced by a table/view, and the author simply left the original structure.
Similarly, once upon a time, perhaps a column needed to be calculated (say for the outer query or select). This column was then included in the table/view, but the structure remained.
I'm pretty sure that Oracle is smart enough to ignore the subquery when optimizing the query. Not all databases are that smart, but you might want to clean-up the code. At the very least, such as subquery looks awkward.

As a basic good practice in SQL, you should not code a full-scan from a table (SELECT * FROM table, without a WHERE clause), unless necessary, for performance issues.
In this case, it's not necessary: The same result can be obtained by:
SELECT
Fields
FROM
TABLE1 LEFT JOIN TABLE2 ON...

Related

Does using subquery to select specific columns in from clause efficient?

Suppose table1 has column a,b,c,.....z.
Does selecting specific columns using sub-query in from clause perform better than
just selecting all(*)?
OR does it result in more calculations?
( I am using HIVE)
A)
select
table1.a,
table1.b,
table2.aa,
table2.bb
FROM (SELECT table1.a
,table1.b
FROM table1
)
join table2 on (table1.b = table2.b)
B)
select
table1.a,
table1.b,
table2.aa,
table2.bb
FROM table1
join table2 on (table1.b = table2.b)
Thanks
Using a subquery to select specific columns is generally a bad idea, regardless of the database. However, in most databases, it make no difference to performance.
Why is it a bad idea? Basically, it can confuse the optimizer and/or anyone reading the query. Depending on the database,
It makes queries a bit harder to maintain (adding a new column can require repeating the column name over and over in subqueries).
The subqueries might be materialized (this should not occur in Hive).
Pruning partitions may not work if the pruning is in outer queries.
It may preclude the use of indexes (does not apply to Hive).
It may confuse the optimizer and statistics (probably does not apply to Hive).
I cannot think of a good reason for actually complicating a query by introducing unnecessary subqueries for this purpose.

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

How should I refactor subqueries

I am aware of three ways of writing subqueries into a query.
Normal subquery
With clause
Temporary table
Subqueries get extremely messy when there are multiples of them in a query, and especially when you have nested subqueries.
With clause is my preference, but you can only use the subqueries in a WITH clause in the select statement which directly follows the WITH clause (I believe).
Temporary tables are good, but they require quite a bit of over head in declaring the table.
Are there any other ways to refactor subqueries other than these?
And are there any trade offs between them that I haven't considered?
You are leaving out some other capabilities.
The most obvious is views. If you have a complex query that is going to be used multiple times -- particularly one that might be implementing business rules between tables -- then a view is very useful. If performance is an issue, then you can materialize the view.
A common use of subqueries is to generate additional columns in a table -- such as the difference between two columns. You can use computed columns for these calculations, and make them part of the data definition.
Finally, you could implement user-defined functions. user-defined table-valued functions are a lot like views with parameters. This can be really helpful under some circumstances. And the underlying queries should generally be quite optimized.
Another type of user-defined functions are scalar functions. These usually incur more overhead, but can be quite useful at times.
All that said, if you structure your queries cleanly, then subqueries and CTEs won't look "messy". You might actually find that you can write very long queries that make sense both to you and to other people reading them.
As a matter of preference and readability more than performance, with is probably the best.
I don't know which database you are using, but in Oracle the with create a temporary view/table accessible with the name on the LHS of the as and is not really distinct from a subquery: this name may be used like it were a normal table.
The select * from (select * from a) is doing the same: the only matter is that you can not reuse that result:
select * from (subquery1) q left join t1 on t1.id = q.id
union all
select * from (subquery1) q left join t2 on t2.id = q.id;
But that is where the query plan is important: subquery1 is the same in both case and the plan may be one that use a temporary table/view, thus reducing the cost of whole.
The with is ultimately a way to create temporary table/view and also force the plan optimizer to build query in some order which may (not) be best.
Temporary table would be good if you know the result would be reused later, not in the same query (in which case the with does the same work, given the temporary table it use) and even transaction (example: saving the result of a search):
begin
insert into tmp (...);
select * from tmp q left join t1 on t1.id = q.id;
select * from tmp q left join t2 on t2.id = q.id;
end;
The tmp table is used twice in the same transaction but not in the same query: your database won't recompute the result twice and it is probably fine if all you are doing are select (no mutation on tmp source).

Why do nested select statements take longer to process than temporary tables?

Forgive me if this is a repeat and/or obvious question, but I can't find a satisfactory answer either on stackoverflow or elsewhere online.
Using Microsoft SQL Server, I have a nested select query that looks like this:
select *
into FinalTable
from
(select * from RawTable1 join RawTable2)
join
(select * from RawTable3 join RawTable4)
Instead of using nested selects, the query can be written using temporary tables, like this:
select *
into Temp1
from RawTable1 join RawTable2
select *
into Temp2
from RawTable3 join RawTable4
select *
into FinalTable
from Temp1 join Temp2
Although equivalent, the second (non-nested) query runs several order of magnitude faster than the first (nested) query. This is true both on my development server and a client's server. Why?
The database engine is holds subqueries in requisite memory at execution time, since they are virtual and not physical, the optimiser can't select the best route, or at least not until a sort in the plan. Also this means the optimiser will be doing multiple full table scans on each operation rather than a possible index seek on a temporary table.
Consider each subquery to be a juggling ball. The more subqueries you give the db engine, the more things it's juggling at one time. If you simplify this in batches of code with a temp table, the optimiser finds a clear route, in most cases regardless of indexes too, at least for more recent versions of SQL Server.

Inner joins involving three tables

I have a SELECT statement that has three inner joins involving two tables.
Apart from creating indexes on the columns referenced in the ON and WHERE clauses, is there other things I can do to optimize the joins, as in rewriting the query?
SELECT
...
FROM
my_table AS t1
INNER JOIN
my_table AS t2
ON
t2.id = t1.id
INNER JOIN
other_table AS t3
ON
t2.id = t3.id
WHERE
...
You can tune PostgreSQL config, VACUUM ANALIZE and all general optimizations.
If this is not enough and you can spend few days you may write code to create materialized view as described in postgresql wiki.
You likely have an error in your example, because you're selecting the same record from my_table twice, you could really just do:
SELECT
...
FROM
my_table AS t1
INNER JOIN
other_table AS t3
ON
t1.id = t3.id
WHERE
...
Because in your example code t1 Will always be t2.
But lets assume you mean ON t2.idX = t1.id; then to answer your question, you can't get much better performance than what you have, you could index them or you could go further and define them as foreign key relationships (which wouldn't do too much in terms of performance benefits compared to non-index vs indexing them).
You might instead like to look at restricting your where clause and perhaps that is where your indexing would be as (if not more) beneficial.
You could write your query as using WHERE EXISTS (if you don't need to select data from all three tables) rather than INNER JOINS but the performance will be almost identical (except when this is itself inside a nested query) as it still needs find the records.
In PostgreSQL. most of your tuning will not be on the actual query. The goal is to help the optimizer figure out how best to execute your declarative query, not to specify how to do it from your program. That isn't to say that sometimes queries can't be optimized themselves, or that they might not need to be, but this doesn't have any of the problem areas that I am aware of, unless you are retrieving a lot more records than you need to (which I have seen happen occasionally).
The fist thing to do is to run vacuum analyze to make sure you have optimal statistics. Then use explain analyze to compare expected query performance to actual. From that point, we'd look at indexes etc. There isn't anything in this query that needs to be optimized on a query level. However without looking at your actual filters in your where clause and the actual output of explain analyze there isn't much that can be suggested.
Typically you tweak the db to choose a better query plan rather than specifying it in your query. That's usually the PostgreSQL way. The comment is of course qualified by noting there are exceptions.