We have TABLE A partitioned by date and does not contain data from today, it only contains data from prior day and going to year to date.
We have TABLE B also partitioned by date which does contain data from today as well as data from prior day going to year to date. On top of TABLE B there is a view, View_B which joins against View_C, View_D and left outer joins Table E. View_C and View_D are each selects from 1 table and do not have any other tables joined in. So View_B looks something like
SELECT b.Foo, c.cItem, d.dItem, E.eItem
FROM TABLE_B b JOIN View_C c on c.cItem = b.cItem
JOIN View_D d on b.dItem = d.dItem
LEFT OUTER JOIN TABLE_E on b.eItem = e.eItem
View_AB joins TABLE A and View_B on extract date as well as one other constraint. So it looks something like:
SELECT a.Col_1, b.Col_2, ...
FROM TABLE_A a LEFT OUTER JOIN View_B b
on a.ExtractDate = b.ExtractDate and a.Foo=b.Foo
-- no where clause
When searching for data from anything other than prior day, the query analyzer does what would be expected and does a hash match join to complete the outer join and reads about 116 pages worth of data from table B. If run for prior day however, the query optimizer freaks out and uses a nested join, scans the table 7000+ times and reads 8,000,000+ pages in the join.
We can fake it/force it to use a different query plan by using join hints, however that causes any constraints in the view that look at table B to cause the optimizer to throw an error that the query can't be completed due to join hints.
Editing to add that the pages/scans = the same number as is hit in one scan when run for a prior day where the optimizer correctly chooses a hash instead of nested join.
As mentioned in the comments, we have severely reduced the impact by creating a covered index on TABLE_B to cover the join in View_B but the IO is still higher than it would be if the optimizer chose the correct plan, especially since the index is essentially redundant for all but prior day searches.
The sqlplan is at http://pastebin.com/m53789da9, sorry that it's not the nicely formatted version.
If you can post the .sqlplan for each of the queries it would help for sure, but my hunch is that you are getting a parallel plan when querying for dates prior to the current day and the nested loop is possibly a constant loop over the partitions included in the table which would then spawn a worker thread for each partition (for more information on this, see the SQLCAT post on parallel plans with partitioned tables in Sql2005). Can't verify if this is the case or not without seeing the plans however.
In case anyone ever runs into this, the issue appears to be only tangentially related to the partitioning scheme. Even though we run a statistics update nightly, it appears that SQL Server
Didn't create a statistic on the ExtractDate
Even when the extract date statistic was explicitly created, didn't pick up that the prior day had data.
We resolved it by doing a CREATE STATISTICS TABLE_A_ExtractDate_Stats ON TABLE_A WITH FULLSCAN. Now searching for prior day and a random sampling of days appears to generate the correct plan.
Related
For an example:
In table a we have 1000000 rows
In table b we have 5 rows
It's more faster if we use
select * from b inner join a on b.id = a.id
than
select * from a inner join b on a.id = b.id
No, JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff. JOIN by order is changed during optimization.
You might test it all by yourself, download some test databases like AdventureWorks or Northwind or try it on your database, you might do this:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
It is unlikely. There are lots of factors on the speed of joining two tables. That is why database engines have an optimization phase, where they consider different ways of implementing the query.
There are many different options:
Nested loops, scanning b first and then a.
Nested loops, scanning a first and then b.
Sorting both tables and using a merge join.
Hashing both tables and using a hash join.
Using an index on b.id.
Using an index on a.id.
And these are just high level descriptions -- there are multiple ways to implement some of these methods. Tables can also be partitioned adding further complexity.
Join order is just one consideration.
In this case, the result of the query is likely to depend on the size of the data being returned, rather than the actual algorithm used for fetching the data.
I have a relative simple query
SELECT
, db1.something
, COALESCE(db2.something_else, 'NA') AS something2
FROM dwh.db_1 AS db1
LEFT JOIN dwh.db_2 AS db2 ON db1.some_id = db2 = some_id
EXPLAIN gives an estimated time of something more than 15 seconds.
On the other hand, explain on the following, where we basically replaced the alias with the table name:
SELECT
, db1.something
, COALESCE(db_2.something_else, 'NA') AS something2
FROM dwh.db_1 AS db1
LEFT JOIN dwh.db_2 AS db2 ON db1.some_id = db2.some_id
gives an estimated time of over 4 hours, where it seems like the system is trying to execute a product join on some spool (I can't really follow the sequence of planning steps).
I always thought that aliases are just aliases and have no impact on perf.
The estimated time is probably correct :-)
A Table-Alias is not really an alias, it replaces the tablename within that query. In Teradata using the original tablename doesn't result in an error message (as it does within most other DBMSes), but it causes a
CROSS join.
Why? Well, Teradata was implemented before there was Standard SQL, the initial query language was called TEQUEL (TEradata QUEry Language), whose syntax didn't require to list tables within FROM. A simple RETRIEVE TableName.ColumnName carried enough information for the Parser/Optimizer to resolve tablename and columnname. There's no flag to switch it off, some client tools refuse to submit it, but you can still submit RETRIEVE in BTEQ.
Within that above example you're mixing old TEQUEL and SQL, there are 3 tables for the optimizer, but only one join-condition, this results
in a CROSS join to the third table.
At least it's easy to spot in Explain. The optimizer will do this stupid join as last step, so scroll to the end and you will see joined using a product join, with a join condition of ("(1=1)").
I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.
First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.
Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!
For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.
There is a simple SQL JOIN statement below:
SELECT
REC.[BarCode]
,REC.[PASSEDPROCESS]
,REC.[PASSEDNODE]
,REC.[ENABLE]
,REC.[ScanTime]
,REC.[ID]
,REC.[Se_Scanner]
,REC.[UserCode]
,REC.[aufnr]
,REC.[dispatcher]
,REC.[matnr]
,REC.[unitcount]
,REC.[maktx]
,REC.[color]
,REC.[machinecode]
,P.PR_NAME
,N.NO_NAME
,I.[inventoryID]
,I.[status]
FROM tbBCScanRec as REC
left join TB_R_INVENTORY_BARCODE as R
ON REC.[BarCode] = R.[barcode]
AND REC.[PASSEDPROCESS] = R.[process]
AND REC.[PASSEDNODE] = R.[node]
left join TB_INVENTORY as I
ON R.[inventid] = I.[id]
INNER JOIN TB_NODE as N
ON N.NO_ID = REC.PASSEDNODE
INNER JOIN TB_PROCESS as P
ON P.PR_CODE = REC.PASSEDPROCESS
The table tbBCScanRec has 556553 records while the table TB_R_INVENTORY_BARCODE has 260513 reccords and the table TB_INVENTORY has 7688. However, the last two tables (TB_NODE and TB_PROCESS) both have fewer than 30 records.
Incredibly, when it runs in SQL Server 2005, it takes 8 hours to return the result set.
Why does it take so much time to execute?
If the two inner joins are removed, it takes just ten seconds to finish running.
What is the matter?
There are at least two UNIQUE NONCLUSTERED INDEXes.
One is IX_INVENTORY_BARCODE_PROCESS_NODE on the table TB_R_INVENTORY_BARCODE, which covers four columns (inventid, barcode, process, and node).
The other is IX_BARCODE_PROCESS_NODE on the table tbBCScanRec, which covers three columns (BarCode, PASSEDPROCESS, and PASSEDNODE).
Well, standard answer to questions like this:
Make sure you have all the necessary indexes in place, i.e. indexes on N.NO_ID, REC.PASSEDNODE, P.PR_CODE, REC.PASSEDPROCESS
Make sure that the types of the columns you join on are the same, so that no implicit conversion is necessary.
You are working with around (556553 *30 *30) 500 millions of rows.
You probably have to add indexes on your tables.
If you are using SQL server, you can watch the plan query to see where you are losing time.
See the documentation here : http://msdn.microsoft.com/en-us/library/ms190623(v=sql.90).aspx
The query plan will help you to create indexes.
When you check the indexing, there should be clustered indexes as well - the nonclustered indexes use the clustered index, so not having one would render the nonclustered useless. Out-dated statistics could also be a problem.
However, why do you need to fetch ALL of the data? What is the purpose of that? You should have WHERE clauses restricting the result set to only what you need.