I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!
Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 15 days ago.
Improve this question
I'm trying to develop a table which involves three tables. TB3 has 500k records, TBL6 has 10 million records and TBL2, TBL4, TBL5 has 20 million records and are the same table. TBL1 is a subquery with a number of joins. I cannot change the database structure but have checked the indices available and that hasn't helped. I've used the OUTER APPLY join at the end as I thought that may speed up my performance but after much experimenting I still and up killing the query after 15-20 minutes.
SELECT TBL1.START_TIME,
TBL1.DEST_TIME,
TBL1.SRCADDRESS,
TBL1.POS,
TBL2.ADDRESSID AS TBL2 FROM (
SELECT TBL3.EVENTTIME,
TBL3.SOURCEADDRESS,
TBL6.FROM_POS,
TBL3.LC_NAME
FROM CUSTOMER_OVERVIEW_V TBL3
INNER JOIN CUSTOMER_SALE_RELATED TBL6 ON TBL6.LC_NAME = TBL3.LC_NAME
AND TBL6.FROM_LOC = TBL3.SOURCEADDRESS
INNER JOIN CUSTOMER TBL4 ON TBL4.CUSTID = TBL3.LC_NAME
AND TBL4.AREATYPE = 'SW'
AND TBL4.EVENTTIME <= TBL3.EVENTTIME + interval '1' second
AND TBL4.EVENTTIME >= TBL3.EVENTTIME - interval '1' second
INNER JOIN CUSTOMER TBL5 ON CUSTID = TBL3.LC_NAME
AND TBL5.AREATYPE = 'SE'
AND TBL5.EVENTTIME <= TBL3.EVENTTIME + interval '1' second
AND TBL5.EVENTTIME >= TBL3.EVENTTIME - interval '1' second
WHERE TBL3.SOURCEADDRESS IS NOT NULL
AND extract(second from TBL5.EVENTTIME - TBL4.EventTime) * 1000 > 250
ORDER BY TBL3.EVENTTIME DESC
FETCH FIRST 500 ROWS ONLY) TBL1
OUTER APPLY (SELECT ADDRESSID
FROM CUSTOMER
WHERE AREATYPE = 'STH'
AND EVENTTIME > TBL1.DEST_TIME
ORDER BY EVENTTIME ASC
FETCH FIRST 1 ROW ONLY) TBL2;
There must be a way to structure this query better to improve the performance so any suggestions would be appreciated.
You are asking for the first 500 rows, so you only want a tiny fraction of your overall data. Therefore you want to use nested loops joins with appropriate indexes in order to get those 500 rows and be done, rather than have to process millions of rows and only then take off the top 500.
That however is complicated by the fact that you want the first 500 after ordering results. That ORDER BY will require a sort, and a sort operation will have to process 100% of the incoming rows in order to produce even its first row of output, which means you have to crunch 100% of your data and that kills your performance.
If your inner joins to TBL6, TBL4 and TBL5 are not meant to drop rows that don't have a matching key (e.g. if making those outer joins would result in the same # of result rows from TBL3), and if you don't really need to filter on extract(second from TBL5.EVENTTIME - TBL4.EventTime) * 1000 > 250 which requires a join of TBL5 and TBL4 to accomplish, and if this CUSTOMER_OVERVIEW_V view is a simple single-table view that doesn't apply any predicates (not likely), or if you can bypass the view and hit the table directly, then you can do this:
Create a 2-column function-based index (e.g. customer_eventtime_idx) on (DECODE(sourceaddress,NULL,'Y','N'),eventtime) of table you need from the customer_overview_v view, in that exact column order.
Rewrite the query to get the 500 rows as early as possible, preferably before any joins, using a hint to force a descending index scan on this index. You will also need to change your IS NOT NULL predicate to the same DECODE function used in the index definition:
SELECT /*+ LEADING(c) USE_NL(csr c1 c2 adr) */
[columns needed]
FROM (SELECT /*+ NO_MERGE INDEX_DESC(x customer_eventtime_idx) */
[columns you need]
FROM TABLE_UNDER_CUSTOMER_OVERVIEW_V x
WHERE DECODE(SOURCEADDRESS,NULL,'Y','N') = 'N'
ORDER BY DECODE(SOURCEADDRESS,NULL,'Y','N') DESC,EVENTTIME DESC
FETCH FIRST 500 ROWS ONLY) c
INNER JOIN CUSTOMER_SALE_RELATED ... csr
INNER JOIN CUSTOMER ... c1
INNER JOIN CUSTOMER ... c2
OUTER APPLY ([address subquery like you have it]) adr
WHERE extract(second from c.EVENTTIME - c1.EventTime) * 1000 > 250
Generate your explain plan and make sure you see the index scan descending operation on the index you created - it must say descending, and also ensure that you see ORDER BY NOSORT step afterwards... it must say NOSORT. Creating the index as we have and ordering our query as we have was all about getting these plan operations to be selected. This is not easy to get to work right. Every detail around the inner c query must be crafted precisely to achieve the trick.
Explanation:
Oracle should seek/access the index on the DECODE Y/N result, so find the N records (those that have a non-null sourceaddress) in the first column of the index starting at the leading edge of the N values, then get the corresponding row from the table, then step backward to the pervious N value, get that table row, then the previous, etc.. emitting rows as they are found. Because the ORDER BY matches the index exactly in the same direction (descending), Oracle will skip the SORT operation as it knows that the data coming out of the index step will already be in the correct order.
These rows therefore stream out of the c inline view as they are found, which allows the FETCH FIRST to stop processing when it gets too 500 rows of output without having to wait for any buffering operation (like a SORT) to complete. You only ever hit those 500 records - it never visits the rest of the rows.
You then join via nested loops to the remaining tables, but you only have 500 driving rows. Obviously you must ensure appropriate indexing on those tables for your join clauses.
If your CUSTOMER_ORDER_V view however does joins and applies more predicates, you simply cannot do this with the view. You will have to use the base tables and apply this trick on whatever base table has that eventtime column, then join in whatever you need that the view was joining to and reproduce any remaining needed view logic (though you might find it does more than you need, and you can omit much of it). In general, don't use views whenever you can help it. You always have more control against base tables.
Lastly, note that I did not follow your TBL1, TBL2, TBL3, etc. alias convention. That is hard to read because you have to constantly look elsewhere to see what "TBL3" means. Far better to use aliases that communicate immediately what table they are, such as the initial letter or couple letters or acronym from the first letter of each word, etc..
I have a table of 'games' where some of those games (but not all) are grouped into 'campaigns' by a campaign ID.
I am trying to write an SQL query which will get a dataset containing various information about the games, but in particular: if a given game is part of a campaign how many games are in that campaign in total (I have this working) and the index of that game within the campaign (e.g. the earliest game in a campaign is index '1', the next is '2' and so on).
I have achieved this, but the execution plan looks terrible and the obvious way to fix that doesn't work, but I get ahead of myself.
Here is the working query, with some extraneous stuff removed:
g1.`id` AS `game_id`,
(SELECT
COUNT(*)
FROM `games` g3
WHERE g3.`campaign` = g1.`campaign`
) AS `campaign_length`,
ca2.`ri` AS `campaign_index`,
ca1.`id` AS `campaign_id`, ca1.`name` AS `campaign_name`
FROM `games` g1
LEFT JOIN `campaigns` ca1 ON ca1.`id` = g1.`campaign`
LEFT JOIN (
SELECT
g4.`id` AS `id`,
ROW_NUMBER() OVER (
PARTITION BY g4.`campaign`
ORDER BY g4.`start` ASC) AS `ri`
FROM `games` g4
) AS ca2 ON ca2.`id` = g1.`id`
WHERE g1.`end` > CURRENT_TIMESTAMP()
AND g1.`gamemaster` = 25
ORDER BY g1.`start` ASC
;
The problem with this version is that for table g4 the execution plan lists a full table scan - which is fine at the moment as there's only a few hundred records, but long term will be terrible for performance, especially as this query (or ones very similar to it) will be executed on many different pages of my website. I believe this is happening because the ROW_NUMBER() function needs to number all the rows before the LEFT JOIN's ON statement can filter them down to the ones I actually need.
The obvious solution, which I have tried to no avail, is to add
WHERE g4.`campaign` = g1.`campaign` after FROM `games` g4;
that way ROW_NUMBER() would only need to number those records that have a chance of being returned in the dataset. However this does not work because g1.`campaign` is not in scope.
I can do WHERE g4.`campaign` IS NOT NULL which at least gets the execution plan down to a Index Conditional instead of a full table scan, but it will still not scale nicely as the number of games in campaigns grows with time.
I know that my "obvious solution" won't work because of the scope problem, but does anyone have a suggestion for how I can achieve what I'm trying to do without a terrible execution plan?
Based on your comments, the campaign_index must be calculated before the WHERE clause is applied. This means that calculation of the campaign_index will always require a full table scan, as the WHERE clause can't reduce the rows being computed over.
You can, however, use windowed functions rather than a self join and correlated sub-query...
WITH
games AS
(
SELECT
*,
COUNT(*)
OVER (
PARTITION BY `campaign`
)
AS `campaign_length`,
ROW_NUMBER()
OVER (
PARTITION BY `campaign`
ORDER BY `start`
)
AS `campaign_index`
FROM
games
)
SELECT
games.*,
campaigns.`name` AS `campaign_name`
FROM
games
LEFT JOIN
campaigns
ON campaigns.`id` = games.`campaign`
WHERE
games.`end` > CURRENT_TIMESTAMP()
AND games.`gamemaster` = 25
ORDER BY
games.`start`
;
Copy the table into a new table with a fresh AUTO_INCREMENT id. This will quickly add row numbers.
CREATE TABLE new_list (
row_num INT AUTO_INCREMENT NOT NULL,
INDEX(row_num) ) ENGINE=InnoDB
SELECT ... FROM ...
ORDER BY ... -- this will do the sorting before numbering
I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.
It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER
I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..
I have the following query which takes too long to retrieve around 70000 records. I noticed that the execution time is proportional to the number of the records retrieved. I need to optimize this query so that the execution time is not proportional to the number of records retrieved. Any idea?
;WITH TT AS (
SELECT TaskParts.[TaskPartID],
PartCost,
LabourCost,
VendorPaidPartAmount,
VendorPaidLabourAmount,
ROW_NUMBER() OVER (ORDER BY [Employees].[EmpCode] asc) AS RowNum
FROM [TaskParts],[Tasks],[WorkOrders], [Employees], [Status],[Models]
,[SubAccounts]WHERE 1=1 AND (TaskParts.TaskLineID = Tasks.TaskLineID)
AND (Tasks.WorkOrderID = [WorkOrders].WorkOrderID)
AND (Tasks.EmpID = [Employees].EmpID)
AND (TaskParts.StatusID = [Status].StatusID)
And (Models.ModelID = Tasks.FailedModelID)
And (SubAccounts.SubAccountID = Tasks.SubAccountID)AND (SubAccounts.GLAccountID = 5))
SELECT --*
COUNT(0)--,
SUM(ISNULL(PartCost,0)),
SUM(ISNULL(LabourCost,0)),
SUM(ISNULL(VendorPaidPartAmount,0)),
SUM(ISNULL(VendorPaidLabourAmount,0))
FROM TT
As Lieven noted, you can remove TD0, TD1 and TP1 as they are redundant.
You can also remove the row_number column, as that is not used and windowing functions are relatively expensive.
It may also be possible to remove some of the tables from the TT CTE if they are not used; however, as table names have not been included with each column selected, it isn't possible to tell which tables are not being used.
Aside from that, your query's response will always be proportional to the number of rows returned, because the RDBMS has to read each row returned to calculate the results.
Make sure that you have support index for each Foreign Key also most probably it is not the issue in this case but MS SQL optimization better works with inner joins.
Also I don't see any reason why you need RowNum if you need only totals.
I have a query which is not using my indexes. Can someone say why?
explain plan set statement_id = 'bad8' for
select
g1.g1id,a.a1id from atable a,
(
select
phone,address,g1id from gtable g
where
g.active = 0 and
(g.name is not null) AND
(SYSDATE - g.CTIME <= 2*365)
) g1
where
(
(a.phone.ph1 = g1.phone.ph1 and
a.phone.ph2 = g1.phone.ph2 and
a.phone.ph3 = g1.phone.ph3
)
OR
(a.address.ad1 = g1.address.ad1 and a.address.ad2 = g1.address.ad2)
)
In both the tables : atable,gtable I have these indexes :
1. On phone.ph1,phone.ph2,phone.ph3
2. On address.ad1,address.ad2
phone,address are of custom data types.
Using Oracle 11g.
Here is the explain plan query and output :
SELECT cardinality "Rows",
lpad(' ',level-1)||operation||' '||
options||' '||object_name "Plan"
FROM PLAN_TABLE
CONNECT BY prior id = parent_id
AND prior statement_id = statement_id
START WITH id = 0
AND statement_id = 'bad8'
ORDER BY id;
Result:
> Rows Plan
490191190 SELECT STATEMENT
> null CONCATENATION
> 490190502 HASH JOIN
> 511841 TABLE ACCESS FULL gtable
> 41332965 PARTITION LIST ALL
> 41332965 TABLE ACCESS FULL atable
> 688 HASH JOIN
> 376893 TABLE ACCESS FULL gtable
> 41332965 PARTITION LIST ALL
> 41332965 TABLE ACCESS FULL atable
Both atable,gtable have more than 10 million rows each.
Most values in columns phone and address don't repeat.
What indices Oracle chosen depends on many factor including things you haven't mentioned in your question such as the number of rows in the table, frequency of values within a column and whether you have separate or combined indices when more than one column is indexed.
Having said that, I suppose that the main reason your indices aren't used are:
You don't join directly with GTABLE / GLOBAL. Instead you join with a view that has three additional WHERE clauses that aren't part of the index and thus make it less effective in this constellation.
The JOIN condition includes an OR, which makes it difficult to use indices.
Update:
If Oracle used your indices to do the join - which is already very difficult due to the OR condition - it would end up with a huge number of ROWIDs. For each ROWID, it then had to fetch the full row. Since a full table scan can easily be up to 50 times faster than a fetch by ROWID (I don't know what value Oracle uses), it will only use the indices if it reliably knows that the join will reduce the number of rows to fetch by a factor of 50.
In your case, there are the remaining WHERE conditions (g.active = 0, g.name is not null, SYSDATE - g.CTIME <= 2*365), which aren't represented in the indices. So they have to applied after the join and after the GTABLE rows have been fetched. This makes it even more difficult to reach a 50 times smaller result set than a full table scan.
So I'm pretty sure the Oracle cost estimate is correct, i.e. using the indices would result in a more expensive query and even longer execution time.
We can say "your query does not use your indexes because does not need them". A hash join is better. To use your indexes, oracle need to full scan them(4 indexes), make two joins, make a rowid or, and after that read from tables probably many blocks. If he belives that the result has many rows, the CBO coose the full scans, because is faster.
There are no conditions that reduce the number of rows taken from tables. There is no range scan. It must do full scans.