Teradata: use of aliases impacts EXPLAIN estimation of time - sql

I have a relative simple query
SELECT
, db1.something
, COALESCE(db2.something_else, 'NA') AS something2
FROM dwh.db_1 AS db1
LEFT JOIN dwh.db_2 AS db2 ON db1.some_id = db2 = some_id
EXPLAIN gives an estimated time of something more than 15 seconds.
On the other hand, explain on the following, where we basically replaced the alias with the table name:
SELECT
, db1.something
, COALESCE(db_2.something_else, 'NA') AS something2
FROM dwh.db_1 AS db1
LEFT JOIN dwh.db_2 AS db2 ON db1.some_id = db2.some_id
gives an estimated time of over 4 hours, where it seems like the system is trying to execute a product join on some spool (I can't really follow the sequence of planning steps).
I always thought that aliases are just aliases and have no impact on perf.

The estimated time is probably correct :-)
A Table-Alias is not really an alias, it replaces the tablename within that query. In Teradata using the original tablename doesn't result in an error message (as it does within most other DBMSes), but it causes a
CROSS join.
Why? Well, Teradata was implemented before there was Standard SQL, the initial query language was called TEQUEL (TEradata QUEry Language), whose syntax didn't require to list tables within FROM. A simple RETRIEVE TableName.ColumnName carried enough information for the Parser/Optimizer to resolve tablename and columnname. There's no flag to switch it off, some client tools refuse to submit it, but you can still submit RETRIEVE in BTEQ.
Within that above example you're mixing old TEQUEL and SQL, there are 3 tables for the optimizer, but only one join-condition, this results
in a CROSS join to the third table.
At least it's easy to spot in Explain. The optimizer will do this stupid join as last step, so scroll to the end and you will see joined using a product join, with a join condition of ("(1=1)").

Related

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

Poor performance with stacked joins

I'm not sure I can provide enough details for an answer, but my company is having a performance issue with an older mssql view. I've narrowed it down to the right outer joins, but I'm not familiar with the structure of joins following joins without a "ON" with each one, as in the code snippet below.
How do I write the joins below to either improve performance or to the simpler format of Join Tablename on Field1 = field2 format ?
FROM dbo.tblObject AS tblObject_2
JOIN dbo.tblProspectB2B PB ON PB.Object_ID = tblObject_2.Object_ID
RIGHT OUTER JOIN dbo.tblProspectB2B_CoordinatorStatus
RIGHT OUTER JOIN dbo.tblObject
INNER JOIN dbo.vwDomain_Hierarchy
INNER JOIN dbo.tblContactUser
INNER JOIN dbo.tblProcessingFile WITH ( NOLOCK )
LEFT OUTER JOIN dbo.enumRetentionRealization AS RR ON RR.RetentionRealizationID = dbo.tblProcessingFile.RetentionLeadTypeID
INNER JOIN dbo.tblLoan
INNER JOIN dbo.tblObject AS tblObject_1 WITH ( NOLOCK ) ON dbo.tblLoan.Object_ID = tblObject_1.Object_ID ON dbo.tblProcessingFile.Loan_ID = dbo.tblLoan.Object_ID ON dbo.tblContactUser.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.vwDomain_Hierarchy.Object_ID = tblObject_1.Domain_ID ON dbo.tblObject.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.tblProspectB2B_CoordinatorStatus.Object_ID = dbo.tblLoan.ReferralSourceContactID ON tblObject_2.Object_ID = dbo.tblLoan.ReferralSourceContactID
Your last INNER JOIN has a number of ON statements. Per this question and answer, such syntax is equivalent to a nested subquery.
That is one of the worst queries I have ever seen. Since I cannot figure out how it is supposed to work without the underlying data, this is what I suggest to you.
First find a good sample loan and write a query against this view to return where loan_id = ... Now you have a data set you chan check you changes against more easily than the, possibly, millions of records this returns. Make sure these results make sense (that right join to tbl_objects is bothering me as it makes no sense to return all the objects records)
Now start writing your query with what you think should be the first table (I would suggest that loan is the first table, if it not then the first table is Object left joined to loan)) and the where clause for the loan id.
Check your results, did you get the same loan information as teh view query with the where clause added?
Then add each join one at a time and see how it affects the query and whether the results appear to be going off track. Once you have figured out a query that gives the same results with all the tables added in, then you can try for several other loan ids to check. Once those have checked out, then run teh whole query with no where clause and check against the view results (if it is a large number you may need to just see if teh record counts match and visually check through (use order by on both things in order to make sure your results are in the same order). In the process try to use only left joins and not that combination of right and left joins (its ok to leave teh inner ones alone).
I make it a habit in complex queries to do all the inner joins first and then the left joins. I never use right joins in production code.
Now you are ready to performance tune.
I woudl guess the right join to objects is causing a problem in that it returns teh whole table and the nature of that table name and teh other joins to the same table leads me to believe that he probably wanted a left join. Without knowing the meaning of the data, it is hard to be sure. So first if you are returning too many records for one loan id, then consider if the real problem is that as tables have grown, returning too many records has become problematic.
Also consider that you can often take teh view and replace it with code to get the same results. Views calling views are a poor technique that often leads to performance issues. Often the views on top of the other views call teh same tables and thus you end up joining to them multiple times when you don;t need to.
Check your Explain plan or Execution plan depending on what database backend you have. Analysis of this should show where you might have missing indexes.
Also make sure that every table in the query is needed. This especially true when you join to a view. The view may join to 12 other tables but you only need the data from one of them and it can join to one of your tables. MAke sure that you are not using select * but only returning teh fields the view actually needs. You have inner joins so, by definition, select * is returning fields you don't need.
If your select part of teh view has a distinct in it, then consider if you can weed down the multiple records you get that made distinct needed by changing to a derived table or adding a where clause. To see what is causing the multiples, you may need to temporarily use select * to see all the columns and find out which one is not uniques and is causing the issue.
This whole process is not going to be easy or fun. Just take it slowly, work carefully and methodically and you will get there and have a query that is understandable and maintainable in the end.

DB2 join optimization with subqueries (NLJOIN vs HSJOIN)

I'm trying to fix the performance of a query I use to count data. On one of the queries the optimizer of DB2 LUW chooses to do a nested loop join instead of an hash join.
The problematic query (Resulting in NLJOIN)
with
source1 as (select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') as LOGICAL_KEY from "SAMPLE"."SOURCE1"),
source2 as (select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') as LOGICAL_KEY from "SAMPLE"."SOURCE2")
select count(*) from (
SELECT
"a"."LOGICAL_KEY",
"b"."LOGICAL_KEY"
FROM
source1 "a",
source2 "b"
WHERE
"a"."LOGICAL_KEY" =
"b"."LOGICAL_KEY"
);
However when I create a table of the subqueries first, the optimizer performs an hash join.
The optimized Query (Resulting in HSJOIN)
CREATE TABLE "SAMPLE"."TMP_SOURCE1" ("LOGICAL_KEY" VARCHAR(4000 BYTE));
CREATE TABLE "SAMPLE"."TMP_SOURCE2" ("LOGICAL_KEY" VARCHAR(4000 BYTE));
insert into "SAMPLE"."TMP_SOURCE1" select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') LOGICAL_KEY from "SAMPLE"."SOURCE1";
insert into "SAMPLE"."TMP_SOURCE2" select COALESCE(CAST("LOGICAL_KEY" AS CHARACTER VARYING(4000)), ']#[') LOGICAL_KEY from "SAMPLE"."SOURCE2";
select count(*) from (
SELECT
"a"."LOGICAL_KEY",
"b"."LOGICAL_KEY"
FROM
(select * from "SAMPLE"."TMP_SOURCE1") "a",
(select * from "SAMPLE"."TMP_SOURCE2") "b"
WHERE
"a"."LOGICAL_KEY" =
"b"."LOGICAL_KEY"
);
How come that DB2 chooses different paths for the same data? Is there any way to force the structure with a subquery to perform an hash join? Due to privilege restrictions I'm not able to create tables.
What is the db2level in your setup ? If it is DB2 v9 or older, then one reason why optimizer did not even consider HSJOIN is that the problematic query contains join predicate on expression. In this case, optimization profile is not an option. Use of session tables is a good idea.
From DB2 v10, HSJOIN also works for join on expression. So, the optimizer starts considering it. But the final decision is still cost-based. The plan that is estimated to be cheaper wins.
References:
v97 info
v10 info
Fundamentally you are not comparing apples with apples. In your first scenario building tables a and b is being done as part of the query and the optimiser includes this work as part of its' whole join strategy. Since it has to build the tables in memory an NLjoin scanning them is not too much extra work. In the second scenario the optimiser considers a join between two existing tables with an equivalence predicate and finds that HSjoin is the way to go.

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

SQL Server 2005 query plan optimizer choking on date partitioned tables

We have TABLE A partitioned by date and does not contain data from today, it only contains data from prior day and going to year to date.
We have TABLE B also partitioned by date which does contain data from today as well as data from prior day going to year to date. On top of TABLE B there is a view, View_B which joins against View_C, View_D and left outer joins Table E. View_C and View_D are each selects from 1 table and do not have any other tables joined in. So View_B looks something like
SELECT b.Foo, c.cItem, d.dItem, E.eItem
FROM TABLE_B b JOIN View_C c on c.cItem = b.cItem
JOIN View_D d on b.dItem = d.dItem
LEFT OUTER JOIN TABLE_E on b.eItem = e.eItem
View_AB joins TABLE A and View_B on extract date as well as one other constraint. So it looks something like:
SELECT a.Col_1, b.Col_2, ...
FROM TABLE_A a LEFT OUTER JOIN View_B b
on a.ExtractDate = b.ExtractDate and a.Foo=b.Foo
-- no where clause
When searching for data from anything other than prior day, the query analyzer does what would be expected and does a hash match join to complete the outer join and reads about 116 pages worth of data from table B. If run for prior day however, the query optimizer freaks out and uses a nested join, scans the table 7000+ times and reads 8,000,000+ pages in the join.
We can fake it/force it to use a different query plan by using join hints, however that causes any constraints in the view that look at table B to cause the optimizer to throw an error that the query can't be completed due to join hints.
Editing to add that the pages/scans = the same number as is hit in one scan when run for a prior day where the optimizer correctly chooses a hash instead of nested join.
As mentioned in the comments, we have severely reduced the impact by creating a covered index on TABLE_B to cover the join in View_B but the IO is still higher than it would be if the optimizer chose the correct plan, especially since the index is essentially redundant for all but prior day searches.
The sqlplan is at http://pastebin.com/m53789da9, sorry that it's not the nicely formatted version.
If you can post the .sqlplan for each of the queries it would help for sure, but my hunch is that you are getting a parallel plan when querying for dates prior to the current day and the nested loop is possibly a constant loop over the partitions included in the table which would then spawn a worker thread for each partition (for more information on this, see the SQLCAT post on parallel plans with partitioned tables in Sql2005). Can't verify if this is the case or not without seeing the plans however.
In case anyone ever runs into this, the issue appears to be only tangentially related to the partitioning scheme. Even though we run a statistics update nightly, it appears that SQL Server
Didn't create a statistic on the ExtractDate
Even when the extract date statistic was explicitly created, didn't pick up that the prior day had data.
We resolved it by doing a CREATE STATISTICS TABLE_A_ExtractDate_Stats ON TABLE_A WITH FULLSCAN. Now searching for prior day and a random sampling of days appears to generate the correct plan.