index is not working while using ROW_NUMBER in sql server - sql

I have used Row_Number() to implement the paging in my stored procedure. Paging is working fine. But problem is, after implementing the Row_Number(), indexes does not work & a Clustered index SCAN happens even if I use the primary key column in order by section.
below is the sample query:
SELECT TOP (#insPageSize) A.RowNum, A.AdID, A.AdTitle, A.AdFor, A.AdCondition,
A.AdExpPrice, A.CreatedDate, A.ModifiedDate, A.AdUID
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY vaa.AdID DESC) AS RowNum,
vaa.AdID, vaa.AdTitle, vaa.CityID, vaa.AdFor, vaa.AdCondition,
vaa.AdExpPrice, vaa.CreatedDate, vaa.ModifiedDate, vaa.AdUID
FROM Catalogue.vwAvailableActiveAds vaa
WHERE vaa.CategoryID = #intCategoryID AND vaa.CountryCode = #chrCountryCode
AND vaa.CreatedDate > DATEADD(dd, -90, GETUTCDATE())
AND vaa.StateID = #inbStateID AND vaa.CityID = #inbCityID
) A
WHERE A.RowNum > (#insPageSize * (#insPageNo - 1))
if I try to execute only inner query:
SELECT ROW_NUMBER() OVER (ORDER BY vaa.AdID DESC) AS RowNum,
vaa.AdID, vaa.AdTitle, vaa.CityID, vaa.AdFor, vaa.AdCondition,
vaa.AdExpPrice, vaa.CreatedDate, vaa.ModifiedDate, vaa.AdUID
FROM Catalogue.vwAvailableActiveAds vaa
WHERE vaa.CategoryID = #intCategoryID AND vaa.CountryCode = #chrCountryCode
AND vaa.CreatedDate > DATEADD(dd, -90, GETUTCDATE())
AND vaa.StateID = #inbStateID AND vaa.CityID = #inbCityID
It does not use any index. AdID is primary key & there is another non clustered index which covers all where clause. But index scan occurs. If I remove the Row_Number() from inner query & check its execution plan, all indexes works fine but again StateID & CityID display as "predicate" while they are in non clustered index.
Please give me some guidance to solve my both problems.

What do you expect, a seek? You are doing several things here that make it very difficult to perform a seek: (a) returning a RANGE of rows; (b) sorting to get ROW_NUMBER(), by AdID DESC - probably not the order your PK is defined; (c) filtering against something other than the PK, and (d) including many columns in the output that are unlikely covered by any NC index. A lot of people throw their hands in the air, yelling, "Oh my gosh! It's a scan! This is terrible!" Even in cases where, in fact, that's the most efficient way to do it.
(Just because a seek doesn't happen doesn't mean "indexes don't work" - it just means they probably would be even less efficient in satisfying this query.)

row_number() is not using the index, because the data has to be filtered first. And, this would seem to require a full table scan. Or, at least, complicated combinations of indexes.
If you build an index on vwAvailableActiveAds(CategoryId, COuntryCode, StateId, CityId, CreatedDate) then it should use this index for the where clause. The application of row_number() would still not use the index, but would presumably be on a much smaller set of data.
By the way, this assumes that the view is really just a select on an underlying table. If the query in the view is more complicated (even with a where clause, join, or group by), then that particular index might not be the best approach.

Related

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".

best way to setup an index for this particular query?

I have a DB with about 2 million rows and I need to fix my current paging and have decided to go with the following:
SET #startRowIndex = ((#Page-1) * #PageSize) + 1;
SET ROWCOUNT #startRowIndex
SELECT #first_id = ProductID FROM LiveProducts (nolock) WHERE ManufacturerID=#ManufacturerID AND ModifiedOn >= #tStamp ORDER BY ProductID
SET ROWCOUNT #PageSize
SELECT * FROM LiveProducts (nolock) WHERE ManufacturerID=#ManufacturerID AND ProductID >= #first_id ORDER BY ProductID
I am no where near a DBA and I want this to be as fast as possible. What index(s) shoud i set on this thing. From my reading and my basic understanding I gathered I should create a no-clustered index on ManufacturerID, ProductID, and ModifiedOn.
But should they all be Index key columns, or just one there and the others in Included Columns?
The first query uses the following columns: ProductId, ManufacturerId, and ModifiedOn.
Because you have an inequality on the date, the index can be used to optimize the where clause but not the order by. However, by including the ProductId in the index, the engine can satisfy the entire query using the following index: LiveProducts(ManufacturerId, ModifiedOn, ProductId). Note that the ordering of these columns is important. And, the query will still need to do a sort for the order by.
The second query is selecting all columns, so it need to go to the original data. So, the optimization is on the where clause only. For this, use LiveProducts(ManufacturerId, ProductId). In this case, it should be able to use the index for the sort.

ROW_NUMBER() execution plan

Please consider this query:
SELECT num,
*
FROM (
SELECT OrderID, CustomerID, EmployeeID, OrderDate, RequiredDate,
ShippedDate,
ROW_NUMBER()
OVER(ORDER BY OrderID) AS num
FROM Orders
) AS numbered
WHERE NUM BETWEEN 0AND 100
when I execute this query and get the execution plan, it's like this:
I want to know
1) What steps SQL Server 2008 pass to add ROW_NUMBER() in a query?
2) Why in first step in Execution plan we have Clustered Index Scan?
3) Why filtering cost is 2%? I mean why for getting appropriate data sql server does not perform a table scan? Does ROW_NUMBER() cause creating an index?
The Segment/Sequence Project portions of the plan relate to the use of ROW_NUMBER().
You have a clustered index scan because there is no WHERE clause on your inner SELECT, hence all rows of the table have to be returned.
The Filter relates to the WHERE clause on the outer SELECT.
That "Compute Scalar" part of the query is the row_number being created.
Because you're selecting every row from Orders, then numbering it, then selecting 1-100. That's a table (or in this case a clustered index) scan anyway you slice it.
No, indexes aren't created on the fly. It's gotta check the rows because the set doesn't come back ordered in your subquery.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

Remove duplicate rows - Impossible to find a decisive answer

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L
As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).
Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.
If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.
If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.