SQL query duration is longer for smaller dataset?

SQL query duration is longer for smaller dataset? - sql

I received reports that a my report generating application was not working. After my initial investigation, I found that the SQL transaction was timing out. I'm mystified as to why the query for a smaller selection of items would take so much longer to return results.
Quick query (averages 4 seconds to return):
SELECT * FROM Payroll WHERE LINEDATE >= '04-17-2010'AND LINEDATE <= '04-24-2010' ORDER BY 'EMPLYEE_NUM' ASC, 'OP_CODE' ASC, 'LINEDATE' ASC
Long query (averages 1 minute 20 seconds to return):
SELECT * FROM Payroll WHERE LINEDATE >= '04-18-2010'AND LINEDATE <= '04-24-2010' ORDER BY 'EMPLYEE_NUM' ASC, 'OP_CODE' ASC, 'LINEDATE' ASC
I could simply increase the timeout on the SqlCommand, but it doesn't change the fact the query is taking longer than it should.
Why would requesting a subset of the items take longer than the query that returns more data? How can I optimize this query?

Most probably, the longer range makes the optimizer to select a full table scan with a sort instead of an index scan, which turns out to be faster.
The index traversal can take longer than a table scan for numerous reasons.
For instance, it is probable that the table itself fits completely into the cache but not the table and the index at the same time.

Besides not having an index on the LINEDATE column.
This depends on the database server you use some maintain statistics which influence the queryplan to optimize access.
For Informix i.g. you would do a UPDATE STATISTICS statment.
And could examine the costs in a query-plan using SET EXPLAIN ON.
You should check your documentation for similiar statements.

Related

poorly performing query on order lines table

I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.

As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).

For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.

Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.

SQL Server Query Optimisation - Unexpected slowness in a simple query

Possible explanation is here in the comment
In SQL Server 2014 Enterprise Edition (64-bit) - I am trying to read from a View. A standard query contains just an ORDER BYand OFFSET-FETCH clause like this.
Approach 1
SELECT
*
FROM Metadata
ORDER BY
AgeInHours ASC,
RankingPoint DESC,
PublishDate DESC
OFFSET 150000 ROWS
FETCH NEXT 40 ROWS ONLY
However, this fairly simple query performs almost 9 times slower (noticable when skipping large number of rows like 150k) than the following query which returns the same result.
In this case I am reading the primary key first and then using that as a parameter for WHERE...IN function
Approach 2
SELECT
*
FROM Metadata
WHERE NewsId IN (
SELECT
NewsId
FROM Metadata
ORDER BY
AgeInHours ASC,
RankingPoint DESC,
PublishDate DESC
OFFSET 150000 ROWS
FETCH NEXT 40 ROWS ONLY
)
ORDER BY
AgeInHours ASC,
RankingPoint DESC,
PublishDate DESC
Bench-marking these two shows this difference
(40 row(s) affected)
SQL Server Execution Times:
CPU time = 14748 ms, elapsed time = 3329 ms.
(40 row(s) affected)
SQL Server Execution Times:
CPU time = 3828 ms, elapsed time = 469 ms.
I have indexes on the primary key, PubilshDate and their fragmentation is very low. I have also tried to run similar queries against the database table, but in every cases the second approach yields great performance gains. I have also tested this on SQL Server 2012.
Can someone explain what is going on?
Schema
Approach 1: Execution Plan
Approach 2: Execution Plan (Left part)
Approach 2: Execution Plan (Right part)

For differently structured queries with even same result set you get different query plans with different approach and query cost. That is common for variety of SQL RDBMS implementations.
Basically in sample above when selecting small part of data from large table is good approach first to reduce and minimize number of rows in result and then select full rows with all columns just like your 2. query.
Another approach is to build exact proper index for reducing result set in first step. In query above probably columns from ORDER BY clause in just same column and sort order could be a solution.
(You didn't sent structure of indexes mentioned in query plans I can just imagine what is hidden behind their names.)
You can also use SQL index hinting to direct SQL optimizer to specific index which you consider as best for task in case SQL optimizer doesn't do the job.

When you execute a query the engine look up for an index that could be used in order to get the best performance. Your approach 1 is using an index which doesn't include all columns in the SELECT statement, this cause the Key Lookup in the query plan, in my experience this always get a lower performance that use only indexed columns in your SELECT statement.
You can see the difference if you create an index for AgeInHours, RankingPoint, PublishDate and INCLUDE all the columns (recommended only for testing purposes).
For your second approach you can even get a better performance if you use a CTE and then make a JOIN instead of WHERE with IN or a temp table with index if you have a millions of rows.

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.

What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....

The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data

If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

SQL - Optimize date calculation for large table

Can this query below be optimized?
select
max(date), sysdate - max(date)
from
table;
Query execution time ~5.7 seconds
I have another approach
select
date, sysdate - date
from
(select * from table order by date desc)
where
rownum = 1;
Query execution ~7.9 seconds
In this particular case, table has around 17,000,000 entries.
Is there a more optimal way to rewrite this?
Update: Well, I tried the hint a few of you suggested in a database development, although with a smaller subset than the original (approximately 1,000,000 records). Without the index the queries runs slower than with the index.
The first query, without index: ~0.56 secs, with index: ~0.2 secs. The second query, without index: ~0.41 secs, with index: ~0.005 secs. (This surprised me, I thought the first query would run faster than the second, maybe it's more suitable for smaller set of records).
I suggested to DBA this solution and he will change the table structure to accommodate this, and then i will test it with the actual data. Thanks

Is there an index on the date column?

That query is simple enough that there's likely nothing that can be done to optimize it beyond adding an index on the date column. What database is this? And is sysdate another column of the table?

SQL massive performance difference using SELECT TOP x even when x is much higher than selected rows

I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
However
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)?
Has anyone else witnessed something like this?
Thanks

Table valued functions can have a non-linear execution time.
Let's consider function equivalent for this query:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.
Time taken to calculate SUM for each row increases as the row numbers increase.
If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.
However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.
Most probably, something similar happens in your case too.
To say something more definitely, I need to see the function definition.
Update:
SQL Server can push predicates into the function.
For instance, I just created this TVF:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
SELECT *
FROM fn_test()
WHERE name = #name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = #name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)

I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
After rebuilding the index the query now completes in less than a second.

Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
Not directly related, but another SO question on TVFs

You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)

It's not necessarily true that the whole table is processed if col1 has an index.
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
I bet one is using an index seek, while the other isn't.
If you have a generic client:
SET SHOWPLAN_ALL ON;
GO
select ...;
go
see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.

I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas