Pagination in SQL - Performance issue - sql

Am trying to use pagination and i got the perfect link in SO
https://stackoverflow.com/a/109290/1481690
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, *
FROM Orders
WHERE OrderDate >= '1980-01-01'
) AS RowConstrainedResult
WHERE RowNum >= 1
AND RowNum < 20
ORDER BY RowNum
Exact same query am trying to use with additional join of few tables in my inner Query.
Am getting few performance issues in following scenarios
WHERE RowNum >= 1
AND RowNum < 20 ==>executes faster approx 2 sec
WHERE RowNum >= 1000
AND RowNum < 1010 ==> more time approx 10 sec
WHERE RowNum >= 30000
AND RowNum < 30010 ==> more time approx 17 sec
Everytime i select 10 rows but huge time difference. Any idea or suggestions ?
I chose this approach as am binding columns dynamically and forming Query. Is there any other better way i can organize the Pagination Query in SQl Server 2008.
Is there a way i can improve the performance of the query ?
Thanks

I always check how much data I am accessing in query and try to eliminate un necessary columns as well as rows.
Well these are just obvious points you might have already check yet just wanted to pointed out in case you haven’t already.
In your query the slow performance might be because you doing “Select *”. Selecting all columns from table does not allow to come with good Execution plan.
Check if you need only selected columns and make sure you have correct covering index on table Orders.
Because explicit SKIPP or OFFSET function is not available in SQL 2008 version we need to create one and that we can create by INNER JOIN.
In one query we will first generate ID with OrderDate and nothing else will be in that query.
We do the same in second query but here we also select some other interested columns from table ORDER or ALL if you need ALL column.
Then we JOIN this to query results by ID and OrderDate and ADD SKIPP rows filter for first query where data set is at its minimal size what is required.
Try this code.
SELECT q2.*
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, OrderDate
FROM Orders
WHERE OrderDate >= '1980-01-01'
)q1
INNER JOIN
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, *
FROM Orders
WHERE OrderDate >= '1980-01-01'
)q2
ON q1.RowNum=q2.RowNum AND q1.OrderDate=q2.OrderDate AND q1.rownum BETWEEN 30000 AND 30020
To give you the estimate, i tried this with following test data and no matter what window you query the results are back in less than 2
seconds, and note that the table is HEAP (no index) Table has total 2M
rows. test select is querying 10 rows from 50,000 to 50,010
The below Insert took around 8 minutes.
IF object_id('TestSelect','u') IS NOT NULL
DROP TABLE TestSelect
GO
CREATE TABLE TestSelect
(
OrderDate DATETIME2(2)
)
GO
DECLARE #i bigint=1, #dt DATETIME2(2)='01/01/1700'
WHILE #I<=2000000
BEGIN
IF #i%15 = 0
SELECT #DT = DATEADD(DAY,1,#dt)
INSERT INTO dbo.TestSelect( OrderDate )
SELECT #dt
SELECT #i=#i+1
END
Selecting the window 50,000 to 50,010 took less than 3 seconds.
Selecting the last single row 2,000,000 to 2,000,000 also took 3 seconds.
SELECT q2.*
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
,OrderDate
FROM TestSelect
WHERE OrderDate >= '1700-01-01'
)q1
INNER JOIN
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
,*
FROM TestSelect
WHERE OrderDate >= '1700-01-01'
)q2
ON q1.RowNum=q2.RowNum
AND q1.OrderDate=q2.OrderDate
AND q1.RowNum BETWEEN 50000 AND 50010

ROW_NUMBER is crappy way of doing pagination as the cost of the operation grows extensively.
Instead you should use double ORDER BY clause.
Say you want to get records with ROW_NUMBER between 1200 and 1210. Instead of using ROW_NUMBER() OVER (...) and later binding the result in WHERE you should rather:
SELECT TOP(11) *
FROM (
SELECT TOP(1210) *
FROM [...]
ORDER BY something ASC
) subQuery
ORDER BY something DESC.
Note that this query will give the result in reverse order. That shouldn't - generally speaking - be an issue as it's easy to reverse the set in the UI so i.e. C#, especially as the resulting set should be relatively small.
The latter is generally a lot faster. Note that the latter solution will be greatly improved by CLUSTERING (CREATE CLUSTERED INDEX ...) on the column you use to sort the query by.
Hope that helps.

Even though you always selecting the same number of rows, performance degrades when you want to select rows at the end of your data window. To get first 10 rows, the engine fetches just 10 rows; to get next 10 it has to fetch 20, discard first 10 , and return 10. To get 30000 -- 30010, it has to read all 30010, skip first 30k, and return 10.
Some tricks to improve performance (not a full list, building OLAP completely skipped).
You mentioned joins; if that's possible join not inside the inner query, but result of it. You can also try to add some logic to ORDER BY OrderDate - ASC or DESC depends on what bucket you are retrieving . Say if you want to grab the "last" 10, ORDER BY ... DESC will work much faster. Needles to say, it has to be an index orderDate.

Incredibly, no other answer has mentioned the fastest way to do paging in all SQL Server versions, specifically with respect to the OP's question where offsets can be terribly slow for large page numbers as is benchmarked here.
There is an entirely different, much faster way to perform paging in SQL. This is often called the "seek method" as described in this blog post here.
SELECT TOP 10 *
FROM Orders
WHERE OrderDate >= '1980-01-01'
AND ((OrderDate > #previousOrderDate)
OR (OrderDate = #previousOrderDate AND OrderId > #previousOrderId))
ORDER BY OrderDate ASC, OrderId ASC
The #previousOrderDate and #previousOrderId values are the respective values of the last record from the previous page. This allows you to fetch the "next" page. If the ORDER BY direction is DESC, simply use < instead.
With the above method, you cannot immediately jump to page 4 without having first fetched the previous 40 records. But often, you do not want to jump that far anyway. Instead, you get a much faster query that might be able to fetch data in constant time, depending on your indexing. Plus, your pages remain "stable", no matter if the underlying data changes (e.g. on page 1, while you're on page 4).
This is the best way to implement paging when lazy loading more data in web applications, for instance.
Note, the "seek method" is also called keyset paging.

declare #pageOffset int
declare #pageSize int
-- set variables at some point
declare #startRow int
set #startRow = #pageOffset * #pageSize
declare #endRow int
set #endRow + #pageSize - 1
SELECT
o.*
FROM
(
SELECT
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
, OrderId
FROM
Orders
WHERE
OrderDate >= '1980-01-01'
) q1
INNER JOIN Orders o
on q1.OrderId = o.OrderId
where
q1.RowNum between #startRow and #endRow
order by
o.OrderDate

#peru, regarding if there is a better way and to build on the explanation provided by #a1ex07, try the following -
If the table has a unique identifier such as a numeric (order-id) or (order-date, order-index) upon which a compare (greater-than, less-than) operation can be performed then use that as an offset instead of the row-number.
For example if the table orders has 'order_id' as primary-key then -
To get the first ten results -
1.
select RowNum, order_id from
( select
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum,
o.order_id
from orders o where o.order_id > 0 ;
)
tmp_qry where RowNum between 1 and 10 order by RowNum; // first 10
Assuming that the last order-id returned was 17 then,
To select the next 10,
2.
select RowNum, order_id from
( select
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum,
o.order_id
from orders o where o.order_id > 17 ;
)
tmp_qry where RowNum between 1 and 10 order by RowNum; // next 10
Note that the row-num values have not been changed. Its the order-id value being compared that has been changed.
If such a key is not present then consider adding one !

Main drawback of your query is that it sorts whole table and calculates Row_Number for every query. You can make life easier for SQL Server by using less columns at sorting stage (for example as suggested by Anup Shah). However you still make it to read, sort and calculate row numbers for every query.
An alternative to calculations on the fly is reading values that were calculateed before.
Depending on volatility of your dataset and number of columns for sorting and filtering you can consider:
Add a rownumber column (or 2-3 columns ) and include it as a first columns in clustered index or create non-clustered inde).
Create views for most frequent combinations and then index those views. It is called indexed (materialised) views.
This will allow to read rownumber and performance will almost not depend on volume. Although maintaining of theese will, but less than sorting whole table for each query.
Note, that is this is a one off query and is run infrequently compared to all other queries, it is better to stick with query optimisation only: efforts to create extra columns/views might not pay off.

Related

Is there any better option to apply pagination without applying OFFSET in SQL Server?

I want to apply pagination on a table with huge data. All I want to know a better option than using OFFSET in SQL Server.
Here is my simple query:
SELECT *
FROM TableName
ORDER BY Id DESC
OFFSET 30000000 ROWS
FETCH NEXT 20 ROWS ONLY
You can use Keyset Pagination for this. It's far more efficient than using Rowset Pagination (paging by row number).
In Rowset Pagination, all previous rows must be read, before being able to read the next page. Whereas in Keyset Pagination, the server can jump immediately to the correct place in the index, so no extra rows are read that do not need to be.
For this to perform well, you need to have a unique index on that key, which includes any other columns you need to query.
In this type of pagination, you cannot jump to a specific page number. You jump to a specific key and read from there. So you need to save the unique ID of page you are on and skip to the next. Alternatively, you could calculate or estimate a starting point for each page up-front.
One big benefit, apart from the obvious efficiency gain, is avoiding the "missing row" problem when paginating, caused by rows being removed from previously read pages. This does not happen when paginating by key, because the key does not change.
Here is an example:
Let us assume you have a table called TableName with an index on Id, and you want to start at the latest Id value and work backwards.
You begin with:
SELECT TOP (#numRows)
*
FROM TableName
ORDER BY Id DESC;
Note the use of ORDER BY to ensure the order is correct
In some RDBMSs you need LIMIT instead of TOP
The client will hold the last received Id value (the lowest in this case). On the next request, you jump to that key and carry on:
SELECT TOP (#numRows)
*
FROM TableName
WHERE Id < #lastId
ORDER BY Id DESC;
Note the use of < not <=
In case you were wondering, in a typical B-Tree+ index, the row with the indicated ID is not read, it's the row after it that's read.
The key chosen must be unique, so if you are paging by a non-unique column then you must add a second column to both ORDER BY and WHERE. You would need an index on OtherColumn, Id for example, to support this type of query. Don't forget INCLUDE columns on the index.
SQL Server does not support row/tuple comparators, so you cannot do (OtherColumn, Id) < (#lastOther, #lastId) (this is however supported in PostgreSQL, MySQL, MariaDB and SQLite).
Instead you need the following:
SELECT TOP (#numRows)
*
FROM TableName
WHERE (
(OtherColumn = #lastOther AND Id < #lastId)
OR OtherColumn < #lastOther
)
ORDER BY
OtherColumn DESC,
Id DESC;
This is more efficient than it looks, as SQL Server can convert this into a proper < over both values.
The presence of NULLs complicates things further. You may want to query those rows separately.
On very big merchant website we use a technic compound of ids stored in a pseudo temporary table and join with this table to the rows of the product table.
Let me talk with a clear example.
We have a table design this way :
CREATE TABLE S_TEMP.T_PAGINATION_PGN
(PGN_ID BIGINT IDENTITY(-9 223 372 036 854 775 808, 1) PRIMARY KEY,
PGN_SESSION_GUID UNIQUEIDENTIFIER NOT NULL,
PGN_SESSION_DATE DATETIME2(0) NOT NULL,
PGN_PRODUCT_ID INT NOT NULL,
PGN_SESSION_ORDER INT NOT NULL);
CREATE INDEX X_PGN_SESSION_GUID_ORDER
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_GUID, PGN_SESSION_ORDER)
INCLUDE (PGN_SESSION_ORDER);
CREATE INDEX X_PGN_SESSION_DATE
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_DATE);
We have a very big product table call T_PRODUIT_PRD and a customer filtered it with many predicates. We INSERT rows from the filtered SELECT into this table this way :
DECLARE #SESSION_ID UNIQUEIDENTIFIER = NEWID();
INSERT INTO S_TEMP.T_PAGINATION_PGN
SELECT #SESSION_ID , SYSUTCDATETIME(), PRD_ID,
ROW_NUMBER() OVER(ORDER BY --> custom order by
FROM dbo.T_PRODUIT_PRD
WHERE ... --> custom filter
Then everytime we need a desired page, compound of #N products we add a join to this table as :
...
JOIN S_TEMP.T_PAGINATION_PGN
ON PGN_SESSION_GUID = #SESSION_ID
AND 1 + (PGN_SESSION_ORDER / #N) = #DESIRED_PAGE_NUMBER
AND PGN_PRODUCT_ID = dbo.T_PRODUIT_PRD.PRD_ID
All the indexes will do the job !
Of course, regularly we have to purge this table and this is why we have a scheduled job which deletes the rows whose sessions were generated more than 4 hours ago :
DELETE FROM S_TEMP.T_PAGINATION_PGN
WHERE PGN_SESSION_DATE < DATEADD(hour, -4, SYSUTCDATETIME());
In the same spirit as SQLPro solution, I propose:
WITH CTE AS
(SELECT 30000000 AS N
UNION ALL SELECT N-1 FROM CTE
WHERE N > 30000000 +1 - 20)
SELECT T.* FROM CTE JOIN TableName T ON CTE.N=T.ID
ORDER BY CTE.N DESC
Tried with 2 billion lines and it's instant !
Easy to make it a stored procedure...
Of course, valid if ids follow each other.

Pagination in two tables in SQL Server

I have two tables,
Orders - this is small, typically up to 50 thousands records
OrdersArchive - this one is normal, about 80 millions records
This situation might happens:
the order might have on of those values for status:
'created'
'processing'
'finished'
The finished orders from Orders are periodically moved to OrdersArchive.
In other words, Orders might contain orders with status created, processing or finished. OrdersArchive contains only orders with a status of finished.
The result has to be sorted in this order 'created', 'processing', 'finished'
I need a query in this two tables which supports pagination.
What is the best way to do it? (so fast as possible)
A pagination might be any type
I mean like:
the classical pagination with PageNumber and CountOfRowsPerPage.
'lazy' pagination with count of orders after the specific Order.
I would use the union SQL operator for this. See the w3schools page for details.
With union you can either do union or union all. The first will check for duplicates while the second just combines the results. It sounds like you shouldn't have duplicates in these two tables so for performance you don't need to do the distinct search.
You also need to make sure that both queries have the same number of columns with similar types.
e.g.
select orderno, status from Orders
union all
select orderno, status from OrdersArchive
order by status, orderno
Pagination
That query gives you the combined resultset for both tables. Now to add pagination I would use a CTE with row numbers like this:
with x as (
select orderno as num, status as stat from Orders
union all
select archiveorderno as num, archivestatus as stat from OrdersArchive
) select row_number() over(order by stat, num) as rownum, num, stat from x
where rownum between 1 and 20
Alternative
If you find using union is too slow then you could look at changing the way your search works. If you always sort the same way and it's always records from Orders followed by records from OrdersArchive then you could query the tables separately. Start by paging through Orders and then when you run out of records continue paging through OrdersArchive. This would be much faster than the union but you would have to keep the query simple and always sort on status. The union allows much more complex searches.
Using OFFSET and FETCH NEXT in SQL Server can provide a paging solution. Rough sample code is:
DECLARE #PageNumber INT = 2
DECLARE #PageSize INT = 100000;
SELECT [ID]
FROM [Table]
ORDER BY [ID]
OFFSET #PageSize * (#PageNumber - 1) ROWS
FETCH NEXT #PageSize ROWS ONLY
Obviously place in your own tables, filters, orders and probably put this in a stored procedure with the PageNumber and PageSize being input params

SQL stored proc runs extremely slow when filtering by high row numbers

This query is generated from a very long dynamic sequel stored procedure -- the procedure returns the requested number of records starting at a given index to be displayed in a Telerik Radgrid, effectively handling paging. A simplified version of the stored proc's output:
SELECT r.* FROM (
SELECT ROW_NUMBER() OVER(ORDER BY InventoryId DESC) as row,
v.* FROM vInventorySearch v
) as R WHERE [ROW] BETWEEN 1 AND 10
When the "BETWEEN" clause is between 1 and 10, it runs in a fraction of a second, but if it's between something like 10000 and 1010 it takes almost a full minute to execute.
I feel like I may be missing something fundamental here, but it seems to me that it shouldn't matter which 10 records I'm retrieving, it should take the same amount of time.
Thanks for any input, I'm looking forward to being embarrassed!
Solution, courtesy Martin Smith (below) :
SELECT r.*, inv.* FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY InventoryId DESC) as row, v.InventoryID
FROM vInventorySearch v
WHERE 1=1
) as R
inner join vInventory inv on r.InventoryID = inv.InventoryID
WHERE [ROW] BETWEEN 10001 AND 10010
Thanks for your help!
Paginating by ROW_NUMBER can indeed be pretty inefficient for higher row numbers.
Sometimes it is better to break it up a bit and have the ROW_NUMBER query on a narrow index to retrieve the matching PKs with a join back onto the base table to retrieve the missing columns.
SQL 2012 has more efficient paging mechanism
http://stevestedman.com/2012/04/tsql-2012-offset-and-fetch/
SELECT DepartmentID, Revenue, Year
FROM REVENUE
ORDER BY Year, DepartmentID ASC
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY;

SQL Server 2008 Paged Row Retrieval and Large Tables

I'm using SQL Server 2008 and the following query to implement paged data retrieval from our JSF application, in below code i am retrieving 25 rows at a time sorted by the default sort column in DESC order.
SELECT * FROM
(
SELECT TOP 25 * FROM
(
SELECT TOP 25 ...... WHERE CONDITIONS
--ORDER BY ... DESC
)AS INNERQUERY ORDER BY INNERQUERY.... ASC
)
AS OUTERQUERY
ORDER BY OUTERQUERY.... DESC
It works, but with one obvious flow. If the users request to see the last page and there are over 10 million records in table, then the second TOP Query will have to first retrieve the 10 million records and only then the first top Query will pick out the Top 25 which will look like:
SELECT * FROM
(
SELECT TOP 25 * FROM
(
SELECT TOP 10000000 ...... WHERE CONDITIONS
--ORDER BY ... DESC
)AS INNERQUERY ORDER BY INNERQUERY.... ASC
)
AS OUTERQUERY
ORDER BY OUTERQUERY.... DESC
I looked into replacing the above with ROW_NUMBER OVER(....) but seemingly i had the same issue where the second TOP statement will have to get the entire result and only then you can do a where ROW_NUMBER between x and y.
Can you please point out my mistakes in the above approach and hints on how it can be optimized?
I'm currently using the following to code to retrieve subset of rows:
WITH PAGED_QRY (
SELECT *, ROW_NUMVER() OVER(ORDER BY Y) AS ROW_NO
FROM TABLE WHERE ....
)
SELECT * FROM PAGED_QRY WHERE ROW_NO BETWEEN #CURRENT_INDEX and # ROWS_TO_RETRIEVE
ORDER BY ROW_NO
where #current_index and #rows_to_retrieve (ie. 1 and 50) are your paging variables. it's cleaner and easier to read.
I've also tried using SET ROW_COUNT #ROWS_TO_RETRIEVE but doesn't seem to make much difference.
Using above query and by carefully studying the execution path of the query and modifying/creating indexes and statistics I've reached results that are sufficiently satisfactory, hence why i'm making this as the answer. The original goal of retrieving only the required rows in the inner query seems to be not possible yet, if you do find the way please let me know.
we can improve above query a bit more.
If I assume that #current_index is the current page number then we can rewrite the above query as:
WITH PAGED_QRY (
SELECT top (#current_index * #rows_to_retrieve) *, ROW_NUMVER()
OVER(ORDER BY Y) AS ROW_NO
FROM TABLE WHERE ....
)
SELECT TOP #ROWS_TO_RETRIEVE FROM PAGED_QRY
ORDER BY ROW_NO DESC
In this case, our inner query will not return the whole record set. Suppose our page_index is 3 & page_size is 50, then it will select only 150 rows(even if our table contains hundreds/thousands/millions of rows) & we can skip the where clause also.

Getting the Nth most recent business date - very different query performance using two different methods

I have a requirement to get txns on a T-5 basis. Meaning I need to "go back" 5 business days.
I've coded up two SQL queries for this and the second method is 5 times slower than the first.
How come?
-- Fast
with
BizDays as
( select top 5 bdate bdate
from dbo.business_days
where bdate < '20091211'
order by bdate Desc
)
,BizDate as ( select min(bdate) bdate from BizDays)
select t.* from txns t
join BizDate on t.bdate <= BizDate.bdate
-- Slow
with
BizDays as
( select dense_rank() Over(order by bdate Desc) RN
, bdate
from dbo.business_days
where bdate < '20091211'
)
,BizDate as ( select bdate from BizDays where RN = 5)
select t.* from txns t
join BizDate on t.bdate <= BizDate.bdate
DENSE_RANK does not stop after the first 5 records like TOP 5 does.
Though DENSE_RANK is monotonic and hence theoretically could be optimized to TOP WITH TIES, SQL Server's optimizer is not aware of that and does not do this optimization.
If your business days are unique, you can replace DENSE_RANK with ROW_NUMBER and get the same performance, since ROW_NUMBER is optimized to a TOP.
instead of putting the conditions in where and join clauses, could you perhaps use ORDER BY on your meeting data and then LIMIT offset, rowcount?
The reason this is running so slow is that DENSE_RANK() and ROW_NUMBER() are functions. The engine has to read every record in the table that matches the WHERE clause, apply the function to each row, save the function value, and then get the top 5 from that list.
A "plain" top 5 uses the index on the table to get the first 5 records that meet the WHERE clause. In the best case, the engine may only have to read a couple of index pages. Worst case, it may have to read a few data pages as well. Even without an index, the engine is reading the rows but does not have to execute the function or work with temporary tables.