DB2 join optimization with subqueries (NLJOIN vs HSJOIN) - sql

I'm trying to fix the performance of a query I use to count data. On one of the queries the optimizer of DB2 LUW chooses to do a nested loop join instead of an hash join.
The problematic query (Resulting in NLJOIN)
with
source1 as (select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') as LOGICAL_KEY from "SAMPLE"."SOURCE1"),
source2 as (select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') as LOGICAL_KEY from "SAMPLE"."SOURCE2")
select count(*) from (
SELECT
"a"."LOGICAL_KEY",
"b"."LOGICAL_KEY"
FROM
source1 "a",
source2 "b"
WHERE
"a"."LOGICAL_KEY" =
"b"."LOGICAL_KEY"
);
However when I create a table of the subqueries first, the optimizer performs an hash join.
The optimized Query (Resulting in HSJOIN)
CREATE TABLE "SAMPLE"."TMP_SOURCE1" ("LOGICAL_KEY" VARCHAR(4000 BYTE));
CREATE TABLE "SAMPLE"."TMP_SOURCE2" ("LOGICAL_KEY" VARCHAR(4000 BYTE));
insert into "SAMPLE"."TMP_SOURCE1" select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') LOGICAL_KEY from "SAMPLE"."SOURCE1";
insert into "SAMPLE"."TMP_SOURCE2" select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') LOGICAL_KEY from "SAMPLE"."SOURCE2";
select count(*) from (
SELECT
"a"."LOGICAL_KEY",
"b"."LOGICAL_KEY"
FROM
(select * from "SAMPLE"."TMP_SOURCE1") "a",
(select * from "SAMPLE"."TMP_SOURCE2") "b"
WHERE
"a"."LOGICAL_KEY" =
"b"."LOGICAL_KEY"
);
How come that DB2 chooses different paths for the same data? Is there any way to force the structure with a subquery to perform an hash join? Due to privilege restrictions I'm not able to create tables.

What is the db2level in your setup ? If it is DB2 v9 or older, then one reason why optimizer did not even consider HSJOIN is that the problematic query contains join predicate on expression. In this case, optimization profile is not an option. Use of session tables is a good idea.
From DB2 v10, HSJOIN also works for join on expression. So, the optimizer starts considering it. But the final decision is still cost-based. The plan that is estimated to be cheaper wins.
References:
v97 info
v10 info

Fundamentally you are not comparing apples with apples. In your first scenario building tables a and b is being done as part of the query and the optimiser includes this work as part of its' whole join strategy. Since it has to build the tables in memory an NLjoin scanning them is not too much extra work. In the second scenario the optimiser considers a join between two existing tables with an equivalence predicate and finds that HSjoin is the way to go.

Related

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

How should I refactor subqueries

I am aware of three ways of writing subqueries into a query.
Normal subquery
With clause
Temporary table
Subqueries get extremely messy when there are multiples of them in a query, and especially when you have nested subqueries.
With clause is my preference, but you can only use the subqueries in a WITH clause in the select statement which directly follows the WITH clause (I believe).
Temporary tables are good, but they require quite a bit of over head in declaring the table.
Are there any other ways to refactor subqueries other than these?
And are there any trade offs between them that I haven't considered?
You are leaving out some other capabilities.
The most obvious is views. If you have a complex query that is going to be used multiple times -- particularly one that might be implementing business rules between tables -- then a view is very useful. If performance is an issue, then you can materialize the view.
A common use of subqueries is to generate additional columns in a table -- such as the difference between two columns. You can use computed columns for these calculations, and make them part of the data definition.
Finally, you could implement user-defined functions. user-defined table-valued functions are a lot like views with parameters. This can be really helpful under some circumstances. And the underlying queries should generally be quite optimized.
Another type of user-defined functions are scalar functions. These usually incur more overhead, but can be quite useful at times.
All that said, if you structure your queries cleanly, then subqueries and CTEs won't look "messy". You might actually find that you can write very long queries that make sense both to you and to other people reading them.
As a matter of preference and readability more than performance, with is probably the best.
I don't know which database you are using, but in Oracle the with create a temporary view/table accessible with the name on the LHS of the as and is not really distinct from a subquery: this name may be used like it were a normal table.
The select * from (select * from a) is doing the same: the only matter is that you can not reuse that result:
select * from (subquery1) q left join t1 on t1.id = q.id
union all
select * from (subquery1) q left join t2 on t2.id = q.id;
But that is where the query plan is important: subquery1 is the same in both case and the plan may be one that use a temporary table/view, thus reducing the cost of whole.
The with is ultimately a way to create temporary table/view and also force the plan optimizer to build query in some order which may (not) be best.
Temporary table would be good if you know the result would be reused later, not in the same query (in which case the with does the same work, given the temporary table it use) and even transaction (example: saving the result of a search):
begin
insert into tmp (...);
select * from tmp q left join t1 on t1.id = q.id;
select * from tmp q left join t2 on t2.id = q.id;
end;
The tmp table is used twice in the same transaction but not in the same query: your database won't recompute the result twice and it is probably fine if all you are doing are select (no mutation on tmp source).

Teradata: use of aliases impacts EXPLAIN estimation of time

I have a relative simple query
SELECT
, db1.something
, COALESCE(db2.something_else, 'NA') AS something2
FROM dwh.db_1 AS db1
LEFT JOIN dwh.db_2 AS db2 ON db1.some_id = db2 = some_id
EXPLAIN gives an estimated time of something more than 15 seconds.
On the other hand, explain on the following, where we basically replaced the alias with the table name:
SELECT
, db1.something
, COALESCE(db_2.something_else, 'NA') AS something2
FROM dwh.db_1 AS db1
LEFT JOIN dwh.db_2 AS db2 ON db1.some_id = db2.some_id
gives an estimated time of over 4 hours, where it seems like the system is trying to execute a product join on some spool (I can't really follow the sequence of planning steps).
I always thought that aliases are just aliases and have no impact on perf.
The estimated time is probably correct :-)
A Table-Alias is not really an alias, it replaces the tablename within that query. In Teradata using the original tablename doesn't result in an error message (as it does within most other DBMSes), but it causes a
CROSS join.
Why? Well, Teradata was implemented before there was Standard SQL, the initial query language was called TEQUEL (TEradata QUEry Language), whose syntax didn't require to list tables within FROM. A simple RETRIEVE TableName.ColumnName carried enough information for the Parser/Optimizer to resolve tablename and columnname. There's no flag to switch it off, some client tools refuse to submit it, but you can still submit RETRIEVE in BTEQ.
Within that above example you're mixing old TEQUEL and SQL, there are 3 tables for the optimizer, but only one join-condition, this results
in a CROSS join to the third table.
At least it's easy to spot in Explain. The optimizer will do this stupid join as last step, so scroll to the end and you will see joined using a product join, with a join condition of ("(1=1)").

Oracle performance questions, inner selects in joins, temporary WITH tables indexes

I would like to consult three aspects of performance (Oracle 11g).
1./ If I define temporary table by keyword "WITH" like
WITH tbl AS (
SELECT [columns from both tables...]
FROM table_with_inexes
JOIN other_table ...
)
SELECT ...
FROM tbl
JOIN xxx ON tbl.column = xxx.column
is subsequent select on that temporary table able to use indexes, that was defined on table_with_inexes and other_table?
2./ Is it possible to add indexes to temporary table created by "WITH" in that above-like single SQL command?
3./ When I have construct such as this:
...
LEFT JOIN (
SELECT indexedColumn, otherColumns
FROM table
JOIN other_table
GROUP BY ...
) C
ON (outerTable.indexedColumn = C.indexedColumn)
in which cases could Oracle use indexes on indexedColumn? I assume, that the select in LEFT JOIN is only "projection" that does not maintain indexes, so the join's ON clausule evaluation is evaluated without using indexes?
The WITH clause (or subquery factoring as it's known as) is just a means of creating aliases for subqueries. It's most useful when you have multiple copies of the same subquery in your query, in which case Oracle may or may not choose to create a temporary table for it behind the scenes (aka "materialize" it). You should read up on this - here's a good link.
To answer your questions:
1) If the indexes are available to be used (no functions on the columns involved, selecting a small percentage of the data etc, etc) then they'll be used, just like in any other query.
2) You can't add indexes to the subquery. Not even to the temporary table that Oracle might create behind the scenes; you have no control over that.
3) I suggest you read up about when indexes might or might not be used. Try http://www.orafaq.com/node/1403 or http://www.orafaq.com/tuningguide/not%20using%20index.html, or perform your own google search.
WITH clause might be either inlined or materialized. It's up to Oracle to decide which approach is better. In your case most probably both queries will have the same execution plan(will be inlined)
PS: even if the table is materialized, indexes can not be added, Oracle can not do that. On the other hand in most cases it is not even necessary, the table can be materialized as a hash table(not heap table) or full table scan is used on it.

Why do nested select statements take longer to process than temporary tables?

Forgive me if this is a repeat and/or obvious question, but I can't find a satisfactory answer either on stackoverflow or elsewhere online.
Using Microsoft SQL Server, I have a nested select query that looks like this:
select *
into FinalTable
from
(select * from RawTable1 join RawTable2)
join
(select * from RawTable3 join RawTable4)
Instead of using nested selects, the query can be written using temporary tables, like this:
select *
into Temp1
from RawTable1 join RawTable2
select *
into Temp2
from RawTable3 join RawTable4
select *
into FinalTable
from Temp1 join Temp2
Although equivalent, the second (non-nested) query runs several order of magnitude faster than the first (nested) query. This is true both on my development server and a client's server. Why?
The database engine is holds subqueries in requisite memory at execution time, since they are virtual and not physical, the optimiser can't select the best route, or at least not until a sort in the plan. Also this means the optimiser will be doing multiple full table scans on each operation rather than a possible index seek on a temporary table.
Consider each subquery to be a juggling ball. The more subqueries you give the db engine, the more things it's juggling at one time. If you simplify this in batches of code with a temp table, the optimiser finds a clear route, in most cases regardless of indexes too, at least for more recent versions of SQL Server.