best way to setup an index for this particular query? - sql

I have a DB with about 2 million rows and I need to fix my current paging and have decided to go with the following:
SET #startRowIndex = ((#Page-1) * #PageSize) + 1;
SET ROWCOUNT #startRowIndex
SELECT #first_id = ProductID FROM LiveProducts (nolock) WHERE ManufacturerID=#ManufacturerID AND ModifiedOn >= #tStamp ORDER BY ProductID
SET ROWCOUNT #PageSize
SELECT * FROM LiveProducts (nolock) WHERE ManufacturerID=#ManufacturerID AND ProductID >= #first_id ORDER BY ProductID
I am no where near a DBA and I want this to be as fast as possible. What index(s) shoud i set on this thing. From my reading and my basic understanding I gathered I should create a no-clustered index on ManufacturerID, ProductID, and ModifiedOn.
But should they all be Index key columns, or just one there and the others in Included Columns?

The first query uses the following columns: ProductId, ManufacturerId, and ModifiedOn.
Because you have an inequality on the date, the index can be used to optimize the where clause but not the order by. However, by including the ProductId in the index, the engine can satisfy the entire query using the following index: LiveProducts(ManufacturerId, ModifiedOn, ProductId). Note that the ordering of these columns is important. And, the query will still need to do a sort for the order by.
The second query is selecting all columns, so it need to go to the original data. So, the optimization is on the where clause only. For this, use LiveProducts(ManufacturerId, ProductId). In this case, it should be able to use the index for the sort.

Related

Is there any better option to apply pagination without applying OFFSET in SQL Server?

I want to apply pagination on a table with huge data. All I want to know a better option than using OFFSET in SQL Server.
Here is my simple query:
SELECT *
FROM TableName
ORDER BY Id DESC
OFFSET 30000000 ROWS
FETCH NEXT 20 ROWS ONLY
You can use Keyset Pagination for this. It's far more efficient than using Rowset Pagination (paging by row number).
In Rowset Pagination, all previous rows must be read, before being able to read the next page. Whereas in Keyset Pagination, the server can jump immediately to the correct place in the index, so no extra rows are read that do not need to be.
For this to perform well, you need to have a unique index on that key, which includes any other columns you need to query.
In this type of pagination, you cannot jump to a specific page number. You jump to a specific key and read from there. So you need to save the unique ID of page you are on and skip to the next. Alternatively, you could calculate or estimate a starting point for each page up-front.
One big benefit, apart from the obvious efficiency gain, is avoiding the "missing row" problem when paginating, caused by rows being removed from previously read pages. This does not happen when paginating by key, because the key does not change.
Here is an example:
Let us assume you have a table called TableName with an index on Id, and you want to start at the latest Id value and work backwards.
You begin with:
SELECT TOP (#numRows)
*
FROM TableName
ORDER BY Id DESC;
Note the use of ORDER BY to ensure the order is correct
In some RDBMSs you need LIMIT instead of TOP
The client will hold the last received Id value (the lowest in this case). On the next request, you jump to that key and carry on:
SELECT TOP (#numRows)
*
FROM TableName
WHERE Id < #lastId
ORDER BY Id DESC;
Note the use of < not <=
In case you were wondering, in a typical B-Tree+ index, the row with the indicated ID is not read, it's the row after it that's read.
The key chosen must be unique, so if you are paging by a non-unique column then you must add a second column to both ORDER BY and WHERE. You would need an index on OtherColumn, Id for example, to support this type of query. Don't forget INCLUDE columns on the index.
SQL Server does not support row/tuple comparators, so you cannot do (OtherColumn, Id) < (#lastOther, #lastId) (this is however supported in PostgreSQL, MySQL, MariaDB and SQLite).
Instead you need the following:
SELECT TOP (#numRows)
*
FROM TableName
WHERE (
(OtherColumn = #lastOther AND Id < #lastId)
OR OtherColumn < #lastOther
)
ORDER BY
OtherColumn DESC,
Id DESC;
This is more efficient than it looks, as SQL Server can convert this into a proper < over both values.
The presence of NULLs complicates things further. You may want to query those rows separately.
On very big merchant website we use a technic compound of ids stored in a pseudo temporary table and join with this table to the rows of the product table.
Let me talk with a clear example.
We have a table design this way :
CREATE TABLE S_TEMP.T_PAGINATION_PGN
(PGN_ID BIGINT IDENTITY(-9 223 372 036 854 775 808, 1) PRIMARY KEY,
PGN_SESSION_GUID UNIQUEIDENTIFIER NOT NULL,
PGN_SESSION_DATE DATETIME2(0) NOT NULL,
PGN_PRODUCT_ID INT NOT NULL,
PGN_SESSION_ORDER INT NOT NULL);
CREATE INDEX X_PGN_SESSION_GUID_ORDER
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_GUID, PGN_SESSION_ORDER)
INCLUDE (PGN_SESSION_ORDER);
CREATE INDEX X_PGN_SESSION_DATE
ON S_TEMP.T_PAGINATION_PGN (PGN_SESSION_DATE);
We have a very big product table call T_PRODUIT_PRD and a customer filtered it with many predicates. We INSERT rows from the filtered SELECT into this table this way :
DECLARE #SESSION_ID UNIQUEIDENTIFIER = NEWID();
INSERT INTO S_TEMP.T_PAGINATION_PGN
SELECT #SESSION_ID , SYSUTCDATETIME(), PRD_ID,
ROW_NUMBER() OVER(ORDER BY --> custom order by
FROM dbo.T_PRODUIT_PRD
WHERE ... --> custom filter
Then everytime we need a desired page, compound of #N products we add a join to this table as :
...
JOIN S_TEMP.T_PAGINATION_PGN
ON PGN_SESSION_GUID = #SESSION_ID
AND 1 + (PGN_SESSION_ORDER / #N) = #DESIRED_PAGE_NUMBER
AND PGN_PRODUCT_ID = dbo.T_PRODUIT_PRD.PRD_ID
All the indexes will do the job !
Of course, regularly we have to purge this table and this is why we have a scheduled job which deletes the rows whose sessions were generated more than 4 hours ago :
DELETE FROM S_TEMP.T_PAGINATION_PGN
WHERE PGN_SESSION_DATE < DATEADD(hour, -4, SYSUTCDATETIME());
In the same spirit as SQLPro solution, I propose:
WITH CTE AS
(SELECT 30000000 AS N
UNION ALL SELECT N-1 FROM CTE
WHERE N > 30000000 +1 - 20)
SELECT T.* FROM CTE JOIN TableName T ON CTE.N=T.ID
ORDER BY CTE.N DESC
Tried with 2 billion lines and it's instant !
Easy to make it a stored procedure...
Of course, valid if ids follow each other.

SQL tuning, long running query + rownum

I have million record in database table having account no, address and many more columns. I want 100 rows in sorting with desc order, I used rownum for this, but the query is taking a long time to execute, since it scans the full table first make it in sorted order then apply the rownum.
What is the solution to minimize the query execution time?
For example:
select *
from
(select
acc_no, address
from
customer
order by
acc_no desc)
where
ROWNUM <= 100;
From past experience I found that the TOP works best for this scenario.
Also you should always select the columns you need only and avoid using the all card (*)
SELECT TOP 100 [acc_no], [address] FROM [customer] ORDER BY [acc_no] DESC
Useful resources about TOP, LIMIT and even ROWNUM.
https://www.w3schools.com/sql/sql_top.asp
Make sure you use index on acc_no column.
If you have an index already present on acc_no, check if that's being used during query execution or not by verifying the query execution plan.
To create a new index if not present, use below query :
Create index idx1 on customer(acc_no); -- If acc_no is not unique
Create unique index idx1 on customer(acc_no); -- If acc_no is unique. Note: Unique index is faster.
If in explain plan output you see "Full table scan", then it is a case that optimizer is not using the index.
Try with a hint first :
select /*+ index(idx1) */ * from
(select
acc_no, address
from
customer
order by
acc_no desc)
where
ROWNUM <= 100;
If the query with hint above returned results quickly, then you need to check why optimizer is ignoring your index deliberately. One probable reason for this is outdated statistics. Refresh the statistics.
Hope this helps.
Consider getting your top account numbers in an inner query / in-line view such that you only perform the joins on those 100 customer records. Otherwise, you could be performing all the joins on the million+ rows, then sorting the million+ results to get the top 100. Something like this may work.
select .....
from customer
where customer.acc_no in (select acc_no from
(select inner_cust.acc_no
from customer inner_cust
order by inner_cust.acc_no desc
)
where rownum <= 100)
and ...
Or, if you are using 12C you can use FETCH FIRST 100 ROWS ONLY
select .....
from customer
where customer.acc_no in (select inner_cust.acc_no
from customer inner_cust
order by inner_cust.acc_no desc
fetch first 100 rows only
)
and ...
This will give the result within 100ms, but MAKE SURE that there is index on column ACC_NO. There also can be combined index on ACC_NO+other colums, but ACC_NO MUST be on the first position in the index. You have to see "range scan" in execution plan. Not "full table scan", not "skip scan". You can probably see nested loops in execution plan (that will fetch ADDRESSes from table). You can improve speed even more by creating combined index for ACC_NO, ADDRESS (in this order). In such case Oracle engine does not have to read the table at all, because all the information is contained in the index. You can compare it in execution plan.
select top 100 acc_no, address
from customer
order by acc_no desc

Pagination in SQL - Performance issue

Am trying to use pagination and i got the perfect link in SO
https://stackoverflow.com/a/109290/1481690
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, *
FROM Orders
WHERE OrderDate >= '1980-01-01'
) AS RowConstrainedResult
WHERE RowNum >= 1
AND RowNum < 20
ORDER BY RowNum
Exact same query am trying to use with additional join of few tables in my inner Query.
Am getting few performance issues in following scenarios
WHERE RowNum >= 1
AND RowNum < 20 ==>executes faster approx 2 sec
WHERE RowNum >= 1000
AND RowNum < 1010 ==> more time approx 10 sec
WHERE RowNum >= 30000
AND RowNum < 30010 ==> more time approx 17 sec
Everytime i select 10 rows but huge time difference. Any idea or suggestions ?
I chose this approach as am binding columns dynamically and forming Query. Is there any other better way i can organize the Pagination Query in SQl Server 2008.
Is there a way i can improve the performance of the query ?
Thanks
I always check how much data I am accessing in query and try to eliminate un necessary columns as well as rows.
Well these are just obvious points you might have already check yet just wanted to pointed out in case you haven’t already.
In your query the slow performance might be because you doing “Select *”. Selecting all columns from table does not allow to come with good Execution plan.
Check if you need only selected columns and make sure you have correct covering index on table Orders.
Because explicit SKIPP or OFFSET function is not available in SQL 2008 version we need to create one and that we can create by INNER JOIN.
In one query we will first generate ID with OrderDate and nothing else will be in that query.
We do the same in second query but here we also select some other interested columns from table ORDER or ALL if you need ALL column.
Then we JOIN this to query results by ID and OrderDate and ADD SKIPP rows filter for first query where data set is at its minimal size what is required.
Try this code.
SELECT q2.*
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, OrderDate
FROM Orders
WHERE OrderDate >= '1980-01-01'
)q1
INNER JOIN
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, *
FROM Orders
WHERE OrderDate >= '1980-01-01'
)q2
ON q1.RowNum=q2.RowNum AND q1.OrderDate=q2.OrderDate AND q1.rownum BETWEEN 30000 AND 30020
To give you the estimate, i tried this with following test data and no matter what window you query the results are back in less than 2
seconds, and note that the table is HEAP (no index) Table has total 2M
rows. test select is querying 10 rows from 50,000 to 50,010
The below Insert took around 8 minutes.
IF object_id('TestSelect','u') IS NOT NULL
DROP TABLE TestSelect
GO
CREATE TABLE TestSelect
(
OrderDate DATETIME2(2)
)
GO
DECLARE #i bigint=1, #dt DATETIME2(2)='01/01/1700'
WHILE #I<=2000000
BEGIN
IF #i%15 = 0
SELECT #DT = DATEADD(DAY,1,#dt)
INSERT INTO dbo.TestSelect( OrderDate )
SELECT #dt
SELECT #i=#i+1
END
Selecting the window 50,000 to 50,010 took less than 3 seconds.
Selecting the last single row 2,000,000 to 2,000,000 also took 3 seconds.
SELECT q2.*
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
,OrderDate
FROM TestSelect
WHERE OrderDate >= '1700-01-01'
)q1
INNER JOIN
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
,*
FROM TestSelect
WHERE OrderDate >= '1700-01-01'
)q2
ON q1.RowNum=q2.RowNum
AND q1.OrderDate=q2.OrderDate
AND q1.RowNum BETWEEN 50000 AND 50010
ROW_NUMBER is crappy way of doing pagination as the cost of the operation grows extensively.
Instead you should use double ORDER BY clause.
Say you want to get records with ROW_NUMBER between 1200 and 1210. Instead of using ROW_NUMBER() OVER (...) and later binding the result in WHERE you should rather:
SELECT TOP(11) *
FROM (
SELECT TOP(1210) *
FROM [...]
ORDER BY something ASC
) subQuery
ORDER BY something DESC.
Note that this query will give the result in reverse order. That shouldn't - generally speaking - be an issue as it's easy to reverse the set in the UI so i.e. C#, especially as the resulting set should be relatively small.
The latter is generally a lot faster. Note that the latter solution will be greatly improved by CLUSTERING (CREATE CLUSTERED INDEX ...) on the column you use to sort the query by.
Hope that helps.
Even though you always selecting the same number of rows, performance degrades when you want to select rows at the end of your data window. To get first 10 rows, the engine fetches just 10 rows; to get next 10 it has to fetch 20, discard first 10 , and return 10. To get 30000 -- 30010, it has to read all 30010, skip first 30k, and return 10.
Some tricks to improve performance (not a full list, building OLAP completely skipped).
You mentioned joins; if that's possible join not inside the inner query, but result of it. You can also try to add some logic to ORDER BY OrderDate - ASC or DESC depends on what bucket you are retrieving . Say if you want to grab the "last" 10, ORDER BY ... DESC will work much faster. Needles to say, it has to be an index orderDate.
Incredibly, no other answer has mentioned the fastest way to do paging in all SQL Server versions, specifically with respect to the OP's question where offsets can be terribly slow for large page numbers as is benchmarked here.
There is an entirely different, much faster way to perform paging in SQL. This is often called the "seek method" as described in this blog post here.
SELECT TOP 10 *
FROM Orders
WHERE OrderDate >= '1980-01-01'
AND ((OrderDate > #previousOrderDate)
OR (OrderDate = #previousOrderDate AND OrderId > #previousOrderId))
ORDER BY OrderDate ASC, OrderId ASC
The #previousOrderDate and #previousOrderId values are the respective values of the last record from the previous page. This allows you to fetch the "next" page. If the ORDER BY direction is DESC, simply use < instead.
With the above method, you cannot immediately jump to page 4 without having first fetched the previous 40 records. But often, you do not want to jump that far anyway. Instead, you get a much faster query that might be able to fetch data in constant time, depending on your indexing. Plus, your pages remain "stable", no matter if the underlying data changes (e.g. on page 1, while you're on page 4).
This is the best way to implement paging when lazy loading more data in web applications, for instance.
Note, the "seek method" is also called keyset paging.
declare #pageOffset int
declare #pageSize int
-- set variables at some point
declare #startRow int
set #startRow = #pageOffset * #pageSize
declare #endRow int
set #endRow + #pageSize - 1
SELECT
o.*
FROM
(
SELECT
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
, OrderId
FROM
Orders
WHERE
OrderDate >= '1980-01-01'
) q1
INNER JOIN Orders o
on q1.OrderId = o.OrderId
where
q1.RowNum between #startRow and #endRow
order by
o.OrderDate
#peru, regarding if there is a better way and to build on the explanation provided by #a1ex07, try the following -
If the table has a unique identifier such as a numeric (order-id) or (order-date, order-index) upon which a compare (greater-than, less-than) operation can be performed then use that as an offset instead of the row-number.
For example if the table orders has 'order_id' as primary-key then -
To get the first ten results -
1.
select RowNum, order_id from
( select
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum,
o.order_id
from orders o where o.order_id > 0 ;
)
tmp_qry where RowNum between 1 and 10 order by RowNum; // first 10
Assuming that the last order-id returned was 17 then,
To select the next 10,
2.
select RowNum, order_id from
( select
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum,
o.order_id
from orders o where o.order_id > 17 ;
)
tmp_qry where RowNum between 1 and 10 order by RowNum; // next 10
Note that the row-num values have not been changed. Its the order-id value being compared that has been changed.
If such a key is not present then consider adding one !
Main drawback of your query is that it sorts whole table and calculates Row_Number for every query. You can make life easier for SQL Server by using less columns at sorting stage (for example as suggested by Anup Shah). However you still make it to read, sort and calculate row numbers for every query.
An alternative to calculations on the fly is reading values that were calculateed before.
Depending on volatility of your dataset and number of columns for sorting and filtering you can consider:
Add a rownumber column (or 2-3 columns ) and include it as a first columns in clustered index or create non-clustered inde).
Create views for most frequent combinations and then index those views. It is called indexed (materialised) views.
This will allow to read rownumber and performance will almost not depend on volume. Although maintaining of theese will, but less than sorting whole table for each query.
Note, that is this is a one off query and is run infrequently compared to all other queries, it is better to stick with query optimisation only: efforts to create extra columns/views might not pay off.

OrderBy clause is resulting different result sets when order column having same data

We have a stored proc to return set of records based on Page Number and Page Size. Sorting is being done by a column "CreateDateTime". If value of CreatedDateTime is same for all the records, it is giving the results sets in different orders. The behavior is inconsistent.
Some Portion of Code:
SET #FirstRec = ( #PageNo - 1 ) * #PageSize
SET #LastRec = ( #PageNo *#PageSize + 1 )
SELECT *
FROM
(
select ROW_NUMBER() OVER (ORDER BY CreatedDateTime)
AS rowNumber,EMPID
From Employee
) as KeyList
WHERE rowNumber > #FirstRec AND rowNumber < #LastRec
Please provide some inputs on this.
This is "by design"
SQL Server (or any RDBMS) does not guarantee results to be returned in a particular order if no ORDER BY clause was specified. Some people think that the rows are always returned in clustered index order or physical disk order if no order by clause is specified. However, that is incorrect as there are many factors that can change row order during query processing. A parallel HASH join is a good example for an operator that changes the row order.
If you specify an ORDER BY clause, SQL Server will sort the rows and return them in the requested order. However, if that order is not deterministic because you have duplicate values, within each "value group" the order is "random" for the same reasons mentioned above.
The only way to guarantee a deterministic order is to include a guaranteed unique column or column group (for example the Primary Key) in the ORDER BY clause.
If you need a reproducible order, then you need to ensure that you specify enough columns in your ORDER BY, such that (the combination of all columns listed in the ORDER BY) is unique for every row. E.g. add EmpID (if that's a primary key) to act as a "tie-breaker" between rows with equal CreatedDateTime values.
If the values in the column you ORDER BY are all the same, then there is no guarantee that they will be retrieved in the same order. You can ORDER BY a second column - perhaps the unique id if there is one? (I have called it UniqueId in the code below). This would ensure the order is always the same.
SELECT *
FROM
(
select ROW_NUMBER() OVER (ORDER BY CreatedDateTime, UniqueId)
AS rowNumber,EMPID
From Employee
) as KeyList
WHERE rowNumber > #FirstRec AND rowNumber < #LastRec

What is the best solution to retrieve large recordsets - more than 3000 rows

I have a SQL Server table with 3000 rows in it. When I retrieve those rows it is taking time using a Select Statement. What is the best solution to retrieve them?
It is essential to port your SQL query here for this question but assuming simple select statement my answers would be
1) First select the limited number of columns that are required. Don't use Select *. Use specific columns if all columns are not required in your desired output
2) If your select statement has a filter then use the filter in such an order that it does the minimum number of operations and gets the optimum result (if you post SQL statements then I can surely help on this)
3) Create an index for the specific field that will also help to improve your query performance
Hope this helps
Since you don't want to show all 3000 records at one time, use paging in your SQL statement. Here is an example using the AdventureWorks database in SQL Server. Assuming each of your webpage shows 25 records, this statement will get all records required in the 5th page. The "QueryResults" is a Common Table Expression (CTE) and I only get the primary keys to keep the CTE small in case you had millions of records. Afterwards, I join the QueryResult (CTE) to the main table (Product) and get any columns I need. #PageNumber below is the current page number. Perform your "WHERE" and sort statements within the CTE.
DECLARE #PageNumber int, #PageSize int;
SET #PageSize = 25;
SET #PageNumber = 5;
; WITH QueryResults AS
(
SELECT TOP (#PageSize * #PageNumber) ROW_NUMBER() OVER (ORDER BY ProductID) AS ROW,
P.ProductID
FROM Production.Product P WITH (NOLOCK)
)
SELECT QR.ROW, QR.ProductID, P.Name
FROM QueryResults QR WITH (NOLOCK)
INNER JOIN Production.Product P WITH (NOLOCK) ON QR.ProductID = P.ProductID
WHERE ROW BETWEEN (((#PageNumber - 1) * #PageSize) + 1) AND (#PageSize * #PageNumber)
ORDER BY QR.ROW ASC
3000 records is not a big deal for SQL Server 2008 you just need to:-
avoid * in a select statement.
proper indexing is needed, you my try include column
try to use index on primary as well as foreign key columns
and you can also try query in different way as same query can be written in different way and the compare both query cost and setting time statistics on.