Oracle performance questions, inner selects in joins, temporary WITH tables indexes - sql

I would like to consult three aspects of performance (Oracle 11g).
1./ If I define temporary table by keyword "WITH" like
WITH tbl AS (
SELECT [columns from both tables...]
FROM table_with_inexes
JOIN other_table ...
)
SELECT ...
FROM tbl
JOIN xxx ON tbl.column = xxx.column
is subsequent select on that temporary table able to use indexes, that was defined on table_with_inexes and other_table?
2./ Is it possible to add indexes to temporary table created by "WITH" in that above-like single SQL command?
3./ When I have construct such as this:
...
LEFT JOIN (
SELECT indexedColumn, otherColumns
FROM table
JOIN other_table
GROUP BY ...
) C
ON (outerTable.indexedColumn = C.indexedColumn)
in which cases could Oracle use indexes on indexedColumn? I assume, that the select in LEFT JOIN is only "projection" that does not maintain indexes, so the join's ON clausule evaluation is evaluated without using indexes?

The WITH clause (or subquery factoring as it's known as) is just a means of creating aliases for subqueries. It's most useful when you have multiple copies of the same subquery in your query, in which case Oracle may or may not choose to create a temporary table for it behind the scenes (aka "materialize" it). You should read up on this - here's a good link.
To answer your questions:
1) If the indexes are available to be used (no functions on the columns involved, selecting a small percentage of the data etc, etc) then they'll be used, just like in any other query.
2) You can't add indexes to the subquery. Not even to the temporary table that Oracle might create behind the scenes; you have no control over that.
3) I suggest you read up about when indexes might or might not be used. Try http://www.orafaq.com/node/1403 or http://www.orafaq.com/tuningguide/not%20using%20index.html, or perform your own google search.

WITH clause might be either inlined or materialized. It's up to Oracle to decide which approach is better. In your case most probably both queries will have the same execution plan(will be inlined)
PS: even if the table is materialized, indexes can not be added, Oracle can not do that. On the other hand in most cases it is not even necessary, the table can be materialized as a hash table(not heap table) or full table scan is used on it.

Related

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

How should I refactor subqueries

I am aware of three ways of writing subqueries into a query.
Normal subquery
With clause
Temporary table
Subqueries get extremely messy when there are multiples of them in a query, and especially when you have nested subqueries.
With clause is my preference, but you can only use the subqueries in a WITH clause in the select statement which directly follows the WITH clause (I believe).
Temporary tables are good, but they require quite a bit of over head in declaring the table.
Are there any other ways to refactor subqueries other than these?
And are there any trade offs between them that I haven't considered?
You are leaving out some other capabilities.
The most obvious is views. If you have a complex query that is going to be used multiple times -- particularly one that might be implementing business rules between tables -- then a view is very useful. If performance is an issue, then you can materialize the view.
A common use of subqueries is to generate additional columns in a table -- such as the difference between two columns. You can use computed columns for these calculations, and make them part of the data definition.
Finally, you could implement user-defined functions. user-defined table-valued functions are a lot like views with parameters. This can be really helpful under some circumstances. And the underlying queries should generally be quite optimized.
Another type of user-defined functions are scalar functions. These usually incur more overhead, but can be quite useful at times.
All that said, if you structure your queries cleanly, then subqueries and CTEs won't look "messy". You might actually find that you can write very long queries that make sense both to you and to other people reading them.
As a matter of preference and readability more than performance, with is probably the best.
I don't know which database you are using, but in Oracle the with create a temporary view/table accessible with the name on the LHS of the as and is not really distinct from a subquery: this name may be used like it were a normal table.
The select * from (select * from a) is doing the same: the only matter is that you can not reuse that result:
select * from (subquery1) q left join t1 on t1.id = q.id
union all
select * from (subquery1) q left join t2 on t2.id = q.id;
But that is where the query plan is important: subquery1 is the same in both case and the plan may be one that use a temporary table/view, thus reducing the cost of whole.
The with is ultimately a way to create temporary table/view and also force the plan optimizer to build query in some order which may (not) be best.
Temporary table would be good if you know the result would be reused later, not in the same query (in which case the with does the same work, given the temporary table it use) and even transaction (example: saving the result of a search):
begin
insert into tmp (...);
select * from tmp q left join t1 on t1.id = q.id;
select * from tmp q left join t2 on t2.id = q.id;
end;
The tmp table is used twice in the same transaction but not in the same query: your database won't recompute the result twice and it is probably fine if all you are doing are select (no mutation on tmp source).

Clarification about Select from (select...) statement

I came across a SQL practice question. The revealed answer is
SELECT ROUND(ABS(a - c) + ABS(b - d), 4) FROM (
SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d
FROM station);
Normally, I would enocunter
select[] from[] where [] (select...)
which to imply that the selected variable from the inner loop at the where clause will determine what is to be queried in the outer loop. As mentioned at the beginning, this time the select is after
FROM to me I'm curious the functionality of this. Is it creating an imaginary table?
The piece in parentheses:
(SELECT MIN(lat_n) AS a, MIN(long_w) AS b, MAX(lat_n) AS c, MAX(long_w) AS d FROM station)
is a subquery.
What's important here is that the result of a subquery looks like a regular table to the outer query. In some SQL flavors, an alias is necessary immediately following the closing parenthesis (i.e. a name by which to refer to the table-like result).
Whether this is technically a "temporary table" is a bit of a detail as its result isn't stored outside the scope of the query; and there is an also a thing called a temporary table which is stored.
Additionally (and this might be the source of confusion), subqueries can also be used in the WHERE clause with an operator (e.g. IN) like this:
SELECT student_name
FROM students
WHERE student_school IN (SEELCT school_name FROM schools WHERE location='Springfield')
This is, as discussed in the comments and the other answer a subquery.
Logically, such a subquery (when it appears in the FROM clause) is executed "first", and then the results treated as a table1. Importantly though, that is not required by the SQL language2. The entire query (including any subqueries) is optimized as a whole.
This can include the optimizer doing things like pushing a predicate from the outer WHERE clause (which, admittedly, your query doesn't have one) down into the subquery, if it's better to evaluate that predicate earlier rather than later.
Similarly, if you had two subqueries in your query that both access the same base table, that does not necessarily mean that the database system will actually query that base table exactly twice.
In any case, whether the database system chooses to materialize the results (store them somewhere) is also decided during the optimization phase. So without knowing your exact RDBMS and the decisions that the optimizer takes to optimize this particular query, it's impossible to say whether it will result in something actually being stored.
1Note that there is no standard terminology for this "result set as a table" produced by a subquery. Some people have mentioned "temporary tables" but since that is a term with a specific meaning in SQL, I shall not be using it here. I generally use the term "result set" to describe any set of data consisting of both columns and rows. This can be used both as a description of the result of the overall query and to describe smaller sections within a query.
2Provided that the final results are the same "as if" the query had been executed in its logical processing order, implementations are free to perform processing in any ordering they choose to.
As there are so many terms involved, I just thought I'll throw in another answer ...
In a relational database we deal with tables. A query reads from tables and its result again is a table (albeit not a stored one).
So in the FROM clause we can access query results just like any stored table:
select * from (select * from t) x;
This makes the inner query a subquery to our main query. We could also call this an ad-hoc view, because view is the word we use for queries we access data from. We can move it to the begin of our main query in order to enhance readability and possibly use it multiple times in it:
with x as (select * from t) select * from x;
We can even store such queries for later access:
create view v as select * from t;
select * from v;
In the SQL standard these terms are used:
BASE TABLE is a stored table we create with CREATE TABLE .... t in above examples is supposed to be a base table.
VIEWED TABLE is a view we create with CREATE VIEW .... v above examples is a viewed table.
DERIVED TABLE is an ad-hoc view, such as x in the examples above.
When using subqueries in other clauses than FROM (e.g. in the SELECT clause or the WHERE clause), we don't use the term "derived table". This is because in these clauses we don't access tables (i.e. something like WHERE mytable = ... does not exist), but columns and expression results. So the term "subquery" is more general than the term "derived table". In those clauses we still use various terms for subqueries, though. There are correlated and non-correlated subqueries and scalar and non-scalar ones.
And to make things even more complicated we can use correlated subqueries in the FROM clause in modern DBMS that feature lateral joins (sometimes implemented as CROSS APPLY and OUTER APPLY). The standard calls these LATERAL DERIVED TABLES.

Index spanning multiple tables in PostgreSQL

Is it possible in PostgreSQL to place an index on an expression containing fields of multiple tables. So for example an index to speed up an query of the following form:
SELECT *, (table1.x + table2.x) AS z
FROM table1
INNER JOIN table2
ON table1.id = table2.id
ORDER BY z ASC
No it's not possible to have an index on many tables, also it really wouldn't guarantee speeding up anything since you won't always get an Index Only Scan. What you really want is a materialized view but pg doesn't have those either. You can try implementing it yourself using triggers like this or this.
Update
As noted by #petter. The materialized views were introduced in 9.3.
No, that's not possible in any currently shipping SQL dbms. Oracle supports bitmap join indexes, but that might not be relevant. It's not clear to me whether you want an index on only the join columns of multiple tables, or whether you want an index on arbitrary columns of joined tables.
To determine the real source of performance problems, learn to read the output of PostgreSQL's EXPLAIN ANALYZE.

Creating Indexes for Group By Fields?

Do you need to create an index for fields of group by fields in an Oracle database?
For example:
select *
from some_table
where field_one is not null and field_two = ?
group by field_three, field_four, field_five
I was testing the indexes I created for the above and the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
It could be correct, but that would depend on how much data you have. Typically I would create an index for the columns I was using in a GROUP BY, but in your case the optimizer may have decided that after using the field_two index that there wouldn't be enough data returned to justify using the other index for the GROUP BY.
No, this can be incorrect.
If you have a large table, Oracle can prefer deriving the fields from the indexes rather than from the table, even there is no single index that covers all values.
In the latest article in my blog:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
, there is a query in which Oracle does not use full table scan but rather joins two indexes to get the column values:
SELECT l.id, l.value
FROM t_left l
WHERE NOT EXISTS
(
SELECT value
FROM t_right r
WHERE r.value = l.value
)
The plan is:
SELECT STATEMENT
HASH JOIN ANTI
VIEW , 20090917_anti.index$_join$_001
HASH JOIN
INDEX FAST FULL SCAN, 20090917_anti.PK_LEFT_ID
INDEX FAST FULL SCAN, 20090917_anti.IX_LEFT_VALUE
INDEX FAST FULL SCAN, 20090917_anti.IX_RIGHT_VALUE
As you can see, there is no TABLE SCAN on t_left here.
Instead, Oracle takes the indexes on id and value, joins them on rowid and gets the (id, value) pairs from the join result.
Now, to your query:
SELECT *
FROM some_table
WHERE field_one is not null and field_two = ?
GROUP BY
field_three, field_four, field_five
First, it will not compile, since you are selecting * from a table with a GROUP BY clause.
You need to replace * with expressions based on the grouping columns and aggregates of the non-grouping columns.
You will most probably benefit from the following index:
CREATE INDEX ix_sometable_23451 ON some_table (field_two, field_three, field_four, field_five, field_one)
, since it will contain everything for both filtering on field_two, sorting on field_three, field_four, field_five (useful for GROUP BY) and making sure that field_one is NOT NULL.
Do you need to create an index for fields of group by fields in an Oracle database?
No. You don't need to, in the sense that a query will run irrespective of whether any indexes exist or not. Indexes are provided to improve query performance.
It can, however, help; but I'd hesitate to add an index just to help one query, without thinking about the possible impact of the new index on the database.
...the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
Not always. Often a GROUP BY will require Oracle to perform a sort (but not always); and you can eliminate the sort operation by providing a suitable index on the column(s) to be sorted.
Whether you actually need to worry about the GROUP BY performance, however, is an important question for you to think about.