Possible explanation is here in the comment
In SQL Server 2014 Enterprise Edition (64-bit) - I am trying to read from a View. A standard query contains just an ORDER BYand OFFSET-FETCH clause like this.
Approach 1
SELECT
*
FROM Metadata
ORDER BY
AgeInHours ASC,
RankingPoint DESC,
PublishDate DESC
OFFSET 150000 ROWS
FETCH NEXT 40 ROWS ONLY
However, this fairly simple query performs almost 9 times slower (noticable when skipping large number of rows like 150k) than the following query which returns the same result.
In this case I am reading the primary key first and then using that as a parameter for WHERE...IN function
Approach 2
SELECT
*
FROM Metadata
WHERE NewsId IN (
SELECT
NewsId
FROM Metadata
ORDER BY
AgeInHours ASC,
RankingPoint DESC,
PublishDate DESC
OFFSET 150000 ROWS
FETCH NEXT 40 ROWS ONLY
)
ORDER BY
AgeInHours ASC,
RankingPoint DESC,
PublishDate DESC
Bench-marking these two shows this difference
(40 row(s) affected)
SQL Server Execution Times:
CPU time = 14748 ms, elapsed time = 3329 ms.
(40 row(s) affected)
SQL Server Execution Times:
CPU time = 3828 ms, elapsed time = 469 ms.
I have indexes on the primary key, PubilshDate and their fragmentation is very low. I have also tried to run similar queries against the database table, but in every cases the second approach yields great performance gains. I have also tested this on SQL Server 2012.
Can someone explain what is going on?
Schema
Approach 1: Execution Plan
Approach 2: Execution Plan (Left part)
Approach 2: Execution Plan (Right part)
For differently structured queries with even same result set you get different query plans with different approach and query cost. That is common for variety of SQL RDBMS implementations.
Basically in sample above when selecting small part of data from large table is good approach first to reduce and minimize number of rows in result and then select full rows with all columns just like your 2. query.
Another approach is to build exact proper index for reducing result set in first step. In query above probably columns from ORDER BY clause in just same column and sort order could be a solution.
(You didn't sent structure of indexes mentioned in query plans I can just imagine what is hidden behind their names.)
You can also use SQL index hinting to direct SQL optimizer to specific index which you consider as best for task in case SQL optimizer doesn't do the job.
When you execute a query the engine look up for an index that could be used in order to get the best performance. Your approach 1 is using an index which doesn't include all columns in the SELECT statement, this cause the Key Lookup in the query plan, in my experience this always get a lower performance that use only indexed columns in your SELECT statement.
You can see the difference if you create an index for AgeInHours, RankingPoint, PublishDate and INCLUDE all the columns (recommended only for testing purposes).
For your second approach you can even get a better performance if you use a CTE and then make a JOIN instead of WHERE with IN or a temp table with index if you have a millions of rows.
Related
I have a very simple query on a table with 60 million rows :
select id, max(version) from mytable group by id
It returns 6 million records and takes more than one hour to run. I just need to run it once because I am transferring the records to another new table that I keep updated.
I tried a few things that didn't work for me but that are often suggested here on stackoverflow:
inner query with select top 1 / order by desc: it is not supported in Sybase ASE
left outer join where a.version < b.version and b.version is null: I interrupted the query after more than one hour and barely a hundred thousand records were found
I understand that Sybase has to do a full scan.
Why could the full scan be so slow?
Is the slowness due to the Sybase ASE instance itself or specific to the query?
What are my options to reduce the running time of the query?
I am not intimately familiar with Sybase optimization. However, your query is really slow. Here are two ideas.
First, add an index on mytable(id, version desc). At a minimum, this is a covering index for the query, meaning that all columns used are in the index. Sybase is probably smart enough to eliminate the group by.
Another option uses the same index, but with a correlated subquery:
select t.id
from mytable t
where t.version = (select max(t2.version)
from mytable t2
where t2.id = t.id
);
This would be a full table scan (a little expensive but not an hour's worth) and an index lookup on each row (pretty cheap). The advantage of this approach is that you can select all the columns you want. The disadvantage is that if two rows have the same maximum version for an id, you will get both in the result set.
Edit : Here Nicolas a more precise answer. I have no particular experience with Sybase but I earned experience working with tones of data with a quite small server on Sql Server. From this experience, I learn that when you work with a large amount of data and your server doesn't have enough memory to deal with that amount of data, you will encounter bottlenecks (I guess it takes times to write the temporary results on the disk). I think it's your case (60 millions rows) but once again, I don't know Sybase and it depends of many factors as the numbers of columns mytable have and the amount of RAM your server have, etc ...
Here the results of a small experience I just did :
I run on Sql-Server and PostgreSQL those two queries.
Query 1 :
SELECT id, max(version)
FROM mytable
GROUP BY id
Query 2 :
SELECT id, version
FROM
(
SELECT id, version, ROW_NUMBER() OVER (PARTITION BY id ORDER BY version DESC) as RN
FROM mytable
) q
WHERE q.rn = 1
On PostgreSQL, mytable has 2.878.441 rows.
Query#1 takes 31.458 sec and returns 1.200.146 rows.
Query#2 takes 41.787 sec and returns 1.200.146 rows.
On Sql Server, mytable has 1.600.010 rows.
Query#1 takes 6 sec and returns 537.232 rows.
Query#2 takes 10 sec and returns 537.232 rows.
So far, your query is always faster. So I tried on a bigger tables.
On PostgreSQL, mytable has now 5.875.134 rows.
Query#1 takes 100.915 sec and returns 2.796.800 rows.
Query#2 takes 98.805 sec and returns 2.796.800 rows.
On Sql Server, mytable has now 11.712.606 rows.
Query#1 takes 28 min 28 sec and returns 6.262.778 rows.
Query#2 takes 2 min 39 sec and returns 6.262.778 rows.
Now we can make an assumption. In the first part on this experience. The two servers have enough memory to deal with the data, thus Group By is faster. The second part on this experiment might prove that too much data kill the performance of group by. To prevent the bottleneck ROW_NUMBER() seems to do the trick.
Criticisms : I don't have a bigger table on PostgreSQL nor I have a Sybase server at hand.
For this experiment I was using PostgreSQL 9.3.5 on x86_64 and SQL Server 2012 - 11.0-2100.60 (X64)
Maybe Nicolas this experiment will help you.
So finally the nonclustered index on (id, version desc) did the trick without having to change anything to the query. Index creation also takes one hour and the query responds in few seconds. But I guess it's still better than having another table that could cause data integrity issues.
the function max() does not help the optimizer to use the index.
Perhaps you should create a function-based index on max(version):
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc32300.1550/html/sqlug/CHDDHJIB.htm
I'm looking to improve the performance of one of our stored procedures and have come across something that I'm struggling to find information on. I'm by no means a DBA, so my knowledge of SQL is not brilliant.
Here's a simplified version of the problem:
If I use the following query -
SELECT * FROM Product
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
I get the results in around 20ms
However if I apply a conditional ordering -
DECLARE #so int = 1
SELECT * FROM Product
ORDER BY
CASE WHEN #so = 1 THEN Name END,
CASE WHEN #so = 2 THEN Name END DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
The overall request in my mind is the same, but the results take 600ms, 30x longer.
The execution plans are drastically different, but being a novice I've no idea how to bring the execution path for the second case into line with the the first case.
Is this even possible, or should I look at creating separate procedures for the order by cases and move choosing the order logic to the code?
NB. This is using MS SQL Server
The reason is because SQL Server can no longer use an index. One solution is dynamic SQL. Another is a simple IF:
IF (#so = 1)
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY
END;
ELSE
BEGIN
SELECT p.*
FROM Product p
ORDER BY Name DESC
OFFSET 100 ROWS
FETCH NEXT 28 ROWS ONLY;
END:
Gordon Linoff is right that this prevents an index from being used, but to expand a bit on that:
When SQL Server prepares execution of a query, it generates an execution plan. This is a query being compiled to steps that the database engine can execute. It's at this point, generally, that it looks at which indices are available for use, but at this point, parameter values are not yet known, so the query optimiser cannot see whether an index on name is useful.
The workarounds in his answer are valid, but I'd like to offer one more:
Add OPTION (RECOMPILE) to your query. This forces your query execution plan to be recompiled each time, and each time, the parameter values are known, and allows the optimiser to optimise for those specific parameter values. It will generally be a bit less efficient than fully dynamic SQL, since dynamic SQL allows each possible statement's execution plan to be cached, but it will likely be better than what you have now, and more maintainable than the other options.
SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);
I received reports that a my report generating application was not working. After my initial investigation, I found that the SQL transaction was timing out. I'm mystified as to why the query for a smaller selection of items would take so much longer to return results.
Quick query (averages 4 seconds to return):
SELECT * FROM Payroll WHERE LINEDATE >= '04-17-2010'AND LINEDATE <= '04-24-2010' ORDER BY 'EMPLYEE_NUM' ASC, 'OP_CODE' ASC, 'LINEDATE' ASC
Long query (averages 1 minute 20 seconds to return):
SELECT * FROM Payroll WHERE LINEDATE >= '04-18-2010'AND LINEDATE <= '04-24-2010' ORDER BY 'EMPLYEE_NUM' ASC, 'OP_CODE' ASC, 'LINEDATE' ASC
I could simply increase the timeout on the SqlCommand, but it doesn't change the fact the query is taking longer than it should.
Why would requesting a subset of the items take longer than the query that returns more data? How can I optimize this query?
Most probably, the longer range makes the optimizer to select a full table scan with a sort instead of an index scan, which turns out to be faster.
The index traversal can take longer than a table scan for numerous reasons.
For instance, it is probable that the table itself fits completely into the cache but not the table and the index at the same time.
Besides not having an index on the LINEDATE column.
This depends on the database server you use some maintain statistics which influence the queryplan to optimize access.
For Informix i.g. you would do a UPDATE STATISTICS statment.
And could examine the costs in a query-plan using SET EXPLAIN ON.
You should check your documentation for similiar statements.
I'm selecting some rows from a table valued function but have found an inexplicable massive performance difference by putting SELECT TOP in the query.
SELECT col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
is taking upwards of 5 or 6 mins to complete.
However
SELECT TOP 6000 col1, col2, col3 etc
FROM dbo.some_table_function
WHERE col1 = #parameter
--ORDER BY col1
completes in about 4 or 5 seconds.
This wouldn't surprise me if the returned set of data were huge, but the particular query involved returns ~5000 rows out of 200,000.
So in both cases, the whole of the table is processed, as SQL Server continues to the end in search of 6000 rows which it will never get to. Why the massive difference then? Is this something to do with the way SQL Server allocates space in anticipation of the result set size (the TOP 6000 thereby giving it a low requirement which is more easily allocated in memory)?
Has anyone else witnessed something like this?
Thanks
Table valued functions can have a non-linear execution time.
Let's consider function equivalent for this query:
SELECT (
SELECT SUM(mi.value)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
mo.value
This query (that calculates the running SUM) is fast at the beginning and slow at the end, since on each row from mo it should sum all the preceding values which requires rewinding the rowsource.
Time taken to calculate SUM for each row increases as the row numbers increase.
If you make mytable large enough (say, 100,000 rows, as in your example) and run this query you will see that it takes considerable time.
However, if you apply TOP 5000 to this query you will see that it completes much faster than 1/20 of the time required for the full table.
Most probably, something similar happens in your case too.
To say something more definitely, I need to see the function definition.
Update:
SQL Server can push predicates into the function.
For instance, I just created this TVF:
CREATE FUNCTION fn_test()
RETURNS TABLE
AS
RETURN (
SELECT *
FROM master
);
These queries:
SELECT *
FROM fn_test()
WHERE name = #name
SELECT TOP 1000 *
FROM fn_test()
WHERE name = #name
yield different execution plans (the first one uses clustered scan, the second one uses an index seek with a TOP)
I had the same problem, a simple query joining five tables returning 1000 rows took two minutes to complete. When I added "TOP 10000" to it it completed in less than one second. It turned out that the clustered index on one of the tables was heavily fragmented.
After rebuilding the index the query now completes in less than a second.
Your TOP has no ORDER BY, so it's simply the same as SET ROWCOUNT 6000 first. An ORDER BY would require all rows to be evaluated first, and it's would take a lot longer.
If dbo.some_table_function is a inline table valued udf, then it's simply a macro that's expanded so it returns the first 6000 rows as mentioned in no particular order.
If the udf is multi valued, then it's a black box and will always pull in the full dataset before filtering. I don't think this is happening.
Not directly related, but another SO question on TVFs
You may be running into something as simple as caching here - perhaps (for whatever reason) the "TOP" query is cached? Using an index that the other isn't?
In any case the best way to quench your curiosity is to examine the full execution plan for both queries. You can do this right in SQL Management Console and it'll tell you EXACTLY what operations are being completed and how long each is predicted to take.
All SQL implementations are quirky in their own way - SQL Server's no exception. These kind of "whaaaaaa?!" moments are pretty common. ;^)
It's not necessarily true that the whole table is processed if col1 has an index.
The SQL optimization will choose whether or not to use an index. Perhaps your "TOP" is forcing it to use the index.
If you are using the MSSQL Query Analyzer (The name escapes me) hit Ctrl-K. This will show the execution plan for the query instead of executing it. Mousing over the icons will show the IO/CPU usage, I believe.
I bet one is using an index seek, while the other isn't.
If you have a generic client:
SET SHOWPLAN_ALL ON;
GO
select ...;
go
see http://msdn.microsoft.com/en-us/library/ms187735.aspx for details.
I think Quassnois' suggestion seems very plausible. By adding TOP 6000 you are implicitly giving the optimizer a hint that a fairly small subset of the 200,000 rows are going to be returned. The optimizer then uses an index seek instead of an clustered index scan or table scan.
Another possible explanation could caching, as Jim davis suggests. This is fairly easy to rule out by running the queries again. Try running the one with TOP 6000 first.