Oracle performance issue in getting first row in sub query - sql

I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.

It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER

I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..

Related

Adding ORDER BY on SQLite takes huge amount of time

I've written the following query:
WITH m2 AS (
SELECT m.id, m.original_title, m.votes, l.name as lang
FROM movies m
JOIN movie_languages ml ON m.id = ml.movie_id
JOIN languages l ON l.id = ml.language_id
)
SELECT m.original_title
FROM movies m
WHERE NOT EXISTS (
SELECT 1
FROM m2
WHERE m.id = m2.id AND m2.lang <> 'English'
)
The results appear after 1.5 seconds.
After adding the following line at the end of the query, it takes at least 5 minutes to run it:
ORDER BY votes DESC;
It's not the size of the data, as ORDER BY on the entire table return results in notime.
What am I doing wrong?
Why is the ORDER BY adds so much time? (The query SELECT * FROM movies ORDER BY votes DESC returns immediately).
The order by in the CTE is irrelevant. But I would suggest aggregation for this purpose:
SELECT m.original_title
FROM movies m JOIN
movie_languages ml
ON m.id = ml.movie_id JOIN
languages l
ON l.id = ml.language_id
GROUP BY m.original_title, m.id
HAVING SUM(lang = 'English') = 0;
In order to examine your queries you may turn on the timer by entering .time on at the SQLite prompt. More importantly utilize the EXPLAIN function to see details on your query.
The query initially written does seem to be rather more complex than necessary as already pointed out above. It does not seem apparent what the necessity is for 'movie_languages' and 'languages' tables in general, but especially in this particular query. That would require more explanation on your part but I believe at least one could be removed thus speeding up your query.
The ORDER BY clause in SQLite is handled as described below.
SQLite attempts to use an index to satisfy the ORDER BY clause of a query when possible. When faced with the choice of using an index to satisfy WHERE clause constraints or satisfying an ORDER BY clause, SQLite does the same cost analysis described above and chooses the index that it believes will result in the fastest answer.
SQLite will also attempt to use indices to help satisfy GROUP BY clauses and the DISTINCT keyword. If the nested loops of the join can be arranged such that rows that are equivalent for the GROUP BY or for the DISTINCT are consecutive, then the GROUP BY or DISTINCT logic can determine if the current row is part of the same group or if the current row is distinct simply by comparing the current row to the previous row. This can be much faster than the alternative of comparing each row to all prior rows.
Since there is no index or type on votes stated and the above logic may be followed thus choosing 'the index that it believes will result in the fastest answer'. With the over-complicated query and no index on votes which is being used as ORDER BY then there is much more for it to figure out than necessary. Since the simple query with ORDER BY executes then the complexity of the query causing SQLite much more to compute than necessary.
Additionally the type of the column, most likely INTEGER, is important when sorting (and joining). Attempting to sort on a character type will not only get you wrong results in this case if votes end up above single digits it would be the wrong type to use (I'm not assuming you are just mentioning it).
So simplify the query, ensure your PRIMARY KEYS are properly set, and test it. If it is still not returning in time try an index on votes. This will give you much better insight into what is going on and how different changes affect your queries.
SQLite Documentation - check all and note 6. Sorting, Grouping and Compound SELECTs
SQLite Documentation - check 10. ORDER BY optimizations
You can do it with NOT EXISTS, without joins and aggregation (assuming that there is always at least 1 row for each movie in the table movie_languages):
SELECT m.*
FROM movies m
WHERE NOT EXISTS (
SELECT 1 FROM movie_languages ml
WHERE m.id = ml.movie_id
AND ml.language_id <> (SELECT l.id FROM languages l WHERE l.lang = 'English')
)
ORDER BY m.votes DESC
or with a LEFT join to languages to get the unmatched rows:
SELECT m.*
FROM movies m
INNER JOIN movie_languages ml ON m.id = ml.movie_id
LEFT JOIN languages l ON l.id = ml.language_id AND l.lang <> 'English'
WHERE l.id IS NULL
ORDER BY m.votes DESC
Refer to this link for more information:
here
In a nutshell, When you include an order by clause, the database builds a list of the rows in the correct order and then returns the data in that order.
The creation of the list mentioned above takes a lot of extra processing, translating into a longer execution time.

SUM(…) OVER (ORDER BY …) in view causes poor performance

I've got a view defined that lists transactions together with a running total, something like
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
The biggest tables this query looks at have ~10 million rows, but on average when the view is queried it will only return a few tens of rows.
My issue is that when this SELECT statement is run directly for a given member, it executes extremely quickly and returns results in a couple of milliseconds. However, when queried as a view...
SELECT h.createdDate, h.value, h.runningTotal
FROM historyView h
WHERE member.username = 'blah#blah.com'
...the performance is dreadful. The two query plans are very different - in the first case it is pretty much ideal but in the latter case, there are loads of scans and hundreds of thousands/millions of rows being read. This is clearly because the filter on member is being run last thing after everything else has been done, rather than right up front at the start.
If I remove the SUM(x) OVER (ORDER BY y) clause, this problem goes away.
Is there something I can do to ensure that the SUM(x) OVER (ORDER BY y) clause does not ruin the query plan?
One solution to my problem is to let the query optimiser know it is safe to filter before running the windowed function by PARTITION'ing by that property. The change to the view is:
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (PARTITION BY m.username ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
Unfortunately this only creates the correct plan if filtering my member's username is part of the query.
That's because there's probably an index on m.username. When it comes to query tuning it takes some trial and error.
When using window functions there is the concept of 'POC' index to take into consideration - just search on google (Itzik Ben-Gan has good references about this as well).
From the book 'High-Performance T-SQL Using Window Functions':
Absent a POC index, the plan includes a Sort iterator, and with large input sets, it can be quite
expensive. Sorting has N * LOG(N) complexity, which is worse than linear. This means that with more
rows, you pay more per row. For example 1000 * LOG(1000) = 3000 and 10000 * LOG(10000) =
40000. This means that 10 times more rows results in 13 times more work, and it gets worse the further you go.
Here's a reference link to get started on window functions and indexes.

Oracle FIRST_ROWS optimizer hint

I'm writing a query against what is currently a small table in development. In production, we expect it to grow quite large over the life of the table (the primary key is a number(10)).
My query does a selection for the top N rows of my table, filtered by specific criteria and ordered by date ascending. Essentially, we're assigning records, in bulk, to a specific user for processing. In my case, N will only be 10, 20, or 30.
I'm currently selecting my primary keys inside a subselect, using rownum to limit my results, like so:
SELECT log_number FROM (
SELECT
il2.log_number,
il2.final_date
FROM log il2
INNER JOIN agent A ON A.agent_id = il2.agent_id
INNER JOIN activity lat ON il2.activity_id = lat.activity_id
WHERE (p_criteria1 IS NULL OR A.criteria1 = p_criteria1)
WHERE lat.criteria2 = p_criteria2
AND lat.criteria3 = p_criteria3
AND il2.criteria3 = p_criteria4
AND il2.current_user IS NULL
GROUP BY il2.log_number, il2.final_date
ORDER BY il2.final_date ASC)
WHERE ROWNUM <= p_how_many;
Although I have a stopkey due to the rownum, I'm wondering if using an Oracle hint here (/*+ FIRST_ROWS(p_how_many) */) on the inner select will affect the query plan in the future. I'd like to know more about what the database does when this hint is specified; does it actually make a difference if you have to order the table? (Seems like it wouldn't.) Or does it only affect the select portion, after the access and join parts?
Looking at the explain plan now doesn't get me much as the table hasn't grown yet.
Thanks for your help!
Even with an ORDER BY, different execution plans could be selected when you limit the number of rows returned. It can be easier to select the top n rows by some order key, then sort those, than to sort the entire table then select the top n rows.
However, the GROUP BY is likely to restrict the benefit of this sort of optimization. Grouping (or a DISTINCT operation) generally prevents the optimizer from using a plan that can pipe individual rows into a STOPKEY operation.

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.
SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').
Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix
This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!
The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.
Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.
[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

How to avoid nested SQL query in this case?

I have an SQL question, related to this and this question (but different). Basically I want to know how I can avoid a nested query.
Let's say I have a huge table of jobs (jobs) executed by a company in their history. These jobs are characterized by year, month, location and the code belonging to the tool used for the job. Additionally I have a table of tools (tools), translating tool codes to tool descriptions and further data about the tool. Now they want a website where they can select year, month, location and tool using a dropdown box, after which the matching jobs will be displayed. I want to fill the last dropdown with only the relevant tools matching the before selection of year, month and location, so I write the following nested query:
SELECT c.tool_code, t.tool_description
FROM (
SELECT DISTINCT j.tool_code
FROM jobs AS j
WHERE j.year = ....
AND j.month = ....
AND j.location = ....
) AS c
LEFT JOIN tools as t
ON c.tool_code = t.tool_code
ORDER BY c.tool_code ASC
I resorted to this nested query because it was much faster than performing a JOIN on the complete database and selecting from that. It got my query time down a lot. But as I have recently read that MySQL nested queries should be avoided at all cost, I am wondering whether I am wrong in this approach. Should I rewrite my query differently? And how?
No, you shouldn't, your query is fine.
Just create an index on jobs (year, month, location, tool_code) and tools (tool_code) so that the INDEX FOR GROUP-BY can be used.
The article your provided describes the subquery predicates (IN (SELECT ...)), not the nested queries (SELECT FROM (SELECT ...)).
Even with the subqueries, the article is wrong: while MySQL is not able to optimize all subqueries, it deals with IN (SELECT …) predicates just fine.
I don't know why the author chose to put DISTINCT here:
SELECT id, name, price
FROM widgets
WHERE id IN
(
SELECT DISTINCT widgetId
FROM widgetOrders
)
and why do they think this will help to improve performance, but given that widgetID is indexed, MySQL will just transform this query:
SELECT id, name, price
FROM widgets
WHERE id IN
(
SELECT widgetId
FROM widgetOrders
)
into an index_subquery
Essentially, this is just like EXISTS clause: the inner subquery will be executed once per widgets row with the additional predicate added:
SELECT NULL
FROM widgetOrders
WHERE widgetId = widgets.id
and stop on the first match in widgetOrders.
This query:
SELECT DISTINCT w.id,w.name,w.price
FROM widgets w
INNER JOIN
widgetOrders o
ON w.id = o.widgetId
will have to use temporary to get rid of the duplicates and will be much slower.
You could avoid the subquery by using GROUP BY, but if the subquery performs better, keep it.
Why do you use a LEFT JOIN instead of a JOIN to join tools?