ROW_NUMBER() execution plan - sql

Please consider this query:
SELECT num,
*
FROM (
SELECT OrderID, CustomerID, EmployeeID, OrderDate, RequiredDate,
ShippedDate,
ROW_NUMBER()
OVER(ORDER BY OrderID) AS num
FROM Orders
) AS numbered
WHERE NUM BETWEEN 0AND 100
when I execute this query and get the execution plan, it's like this:
I want to know
1) What steps SQL Server 2008 pass to add ROW_NUMBER() in a query?
2) Why in first step in Execution plan we have Clustered Index Scan?
3) Why filtering cost is 2%? I mean why for getting appropriate data sql server does not perform a table scan? Does ROW_NUMBER() cause creating an index?

The Segment/Sequence Project portions of the plan relate to the use of ROW_NUMBER().
You have a clustered index scan because there is no WHERE clause on your inner SELECT, hence all rows of the table have to be returned.
The Filter relates to the WHERE clause on the outer SELECT.

That "Compute Scalar" part of the query is the row_number being created.
Because you're selecting every row from Orders, then numbering it, then selecting 1-100. That's a table (or in this case a clustered index) scan anyway you slice it.
No, indexes aren't created on the fly. It's gotta check the rows because the set doesn't come back ordered in your subquery.

Related

SQL tuning, long running query + rownum

I have million record in database table having account no, address and many more columns. I want 100 rows in sorting with desc order, I used rownum for this, but the query is taking a long time to execute, since it scans the full table first make it in sorted order then apply the rownum.
What is the solution to minimize the query execution time?
For example:
select *
from
(select
acc_no, address
from
customer
order by
acc_no desc)
where
ROWNUM <= 100;
From past experience I found that the TOP works best for this scenario.
Also you should always select the columns you need only and avoid using the all card (*)
SELECT TOP 100 [acc_no], [address] FROM [customer] ORDER BY [acc_no] DESC
Useful resources about TOP, LIMIT and even ROWNUM.
https://www.w3schools.com/sql/sql_top.asp
Make sure you use index on acc_no column.
If you have an index already present on acc_no, check if that's being used during query execution or not by verifying the query execution plan.
To create a new index if not present, use below query :
Create index idx1 on customer(acc_no); -- If acc_no is not unique
Create unique index idx1 on customer(acc_no); -- If acc_no is unique. Note: Unique index is faster.
If in explain plan output you see "Full table scan", then it is a case that optimizer is not using the index.
Try with a hint first :
select /*+ index(idx1) */ * from
(select
acc_no, address
from
customer
order by
acc_no desc)
where
ROWNUM <= 100;
If the query with hint above returned results quickly, then you need to check why optimizer is ignoring your index deliberately. One probable reason for this is outdated statistics. Refresh the statistics.
Hope this helps.
Consider getting your top account numbers in an inner query / in-line view such that you only perform the joins on those 100 customer records. Otherwise, you could be performing all the joins on the million+ rows, then sorting the million+ results to get the top 100. Something like this may work.
select .....
from customer
where customer.acc_no in (select acc_no from
(select inner_cust.acc_no
from customer inner_cust
order by inner_cust.acc_no desc
)
where rownum <= 100)
and ...
Or, if you are using 12C you can use FETCH FIRST 100 ROWS ONLY
select .....
from customer
where customer.acc_no in (select inner_cust.acc_no
from customer inner_cust
order by inner_cust.acc_no desc
fetch first 100 rows only
)
and ...
This will give the result within 100ms, but MAKE SURE that there is index on column ACC_NO. There also can be combined index on ACC_NO+other colums, but ACC_NO MUST be on the first position in the index. You have to see "range scan" in execution plan. Not "full table scan", not "skip scan". You can probably see nested loops in execution plan (that will fetch ADDRESSes from table). You can improve speed even more by creating combined index for ACC_NO, ADDRESS (in this order). In such case Oracle engine does not have to read the table at all, because all the information is contained in the index. You can compare it in execution plan.
select top 100 acc_no, address
from customer
order by acc_no desc

Optimized db2 query with total count and pagination

I have complex query which I am simplifying for greater understanding.
Query: The query is having group by, order by, where clause and multiple joins with other tables.
SELECT FIRSTNAME, LASTNAME FROM CUSTOMERS ;
Requirement: NEED OPTIMIZED APPROACH, Get the total count of records of a BIG query along with pagination.
My approach 1: Execute two queries to find count first and then paginated rows
SELECT COUNT(1) FROM FROM CUSTOMERS ;
SELECT FIRSTNAME, LASTNAME, ROWNUMER FROM (
SELECT FIRSTNAME, LASTNAME, ROW_NUMBER() OVER(ORDER BY CUSTOMERCID) AS ROWNUMER FROM CUSTOMERS
)WHERE ROWNUMER BETWEEN 10 AND 20;
My approach 2: Find total count as well as get the required paginated rows in a single query.
SELECT FIRSTNAME, LASTNAME, ROWNUMER, COUNTROWS FROM (
SELECT FIRSTNAME, LASTNAME, ROW_NUMBER() OVER(ORDER BY CUSTOMERCID) AS ROWNUMER
FROM CUSTOMERS
) AS CUST, (SELECT COUNT(1) AS COUNTROWS FROM CUSTOMERS ) AS COUNTNUM
WHERE ROWNUMER BETWEEN 10 AND 20;
My approach 3: Create a VIEW of second approach.
Please suggest which approach I should opt? As per my research, 3rd approach will be more optimized compared to other approach as DATABSE VIEWS are more optimized.
There's nothing about a view that automatically makes it "more optimized" than the query contained within it. The query optimizer decomposes the original SQL and often rewrites it into a much different-looking statement before execution.
After performing RUNSTATS to ensure your tables and indexes have accurate statistics, DB2's built-in EXPLAIN tools such as the db2expln utility, the Design Advisor (db2advis), and the Visual Explain tool in IBM Data Studio offer the best chance at understanding exactly why a particular query option is better or worse than another.
Best performance for pagination is when the least number columns are doing the work (pagination) and then join the results by the key columns to get more data. Two columns control pagination customercid and rownumber one is the primary key already indexed because customercid I'm assuming is unique. customercid is also in the row_number() function so this is the most efficient pagination.
create view dancustomercid as
SELECT CustomerCID, ROW_NUMBER() OVER(ORDER BY CUSTOMERCID) AS ROWNUMER FROM CUSTOMERS
Then join on the output from the view notice there is no order by to slow things down just a join on key fields customercid
SELECT FIRSTNAME, LASTNAME, ROWNUMBER from dancustomercid a join CUSTOMERCID AS b on
a.customercid = b.customercid where a.ROWNUMER
BETWEEN 11 AND 20;

best way to setup an index for this particular query?

I have a DB with about 2 million rows and I need to fix my current paging and have decided to go with the following:
SET #startRowIndex = ((#Page-1) * #PageSize) + 1;
SET ROWCOUNT #startRowIndex
SELECT #first_id = ProductID FROM LiveProducts (nolock) WHERE ManufacturerID=#ManufacturerID AND ModifiedOn >= #tStamp ORDER BY ProductID
SET ROWCOUNT #PageSize
SELECT * FROM LiveProducts (nolock) WHERE ManufacturerID=#ManufacturerID AND ProductID >= #first_id ORDER BY ProductID
I am no where near a DBA and I want this to be as fast as possible. What index(s) shoud i set on this thing. From my reading and my basic understanding I gathered I should create a no-clustered index on ManufacturerID, ProductID, and ModifiedOn.
But should they all be Index key columns, or just one there and the others in Included Columns?
The first query uses the following columns: ProductId, ManufacturerId, and ModifiedOn.
Because you have an inequality on the date, the index can be used to optimize the where clause but not the order by. However, by including the ProductId in the index, the engine can satisfy the entire query using the following index: LiveProducts(ManufacturerId, ModifiedOn, ProductId). Note that the ordering of these columns is important. And, the query will still need to do a sort for the order by.
The second query is selecting all columns, so it need to go to the original data. So, the optimization is on the where clause only. For this, use LiveProducts(ManufacturerId, ProductId). In this case, it should be able to use the index for the sort.

index is not working while using ROW_NUMBER in sql server

I have used Row_Number() to implement the paging in my stored procedure. Paging is working fine. But problem is, after implementing the Row_Number(), indexes does not work & a Clustered index SCAN happens even if I use the primary key column in order by section.
below is the sample query:
SELECT TOP (#insPageSize) A.RowNum, A.AdID, A.AdTitle, A.AdFor, A.AdCondition,
A.AdExpPrice, A.CreatedDate, A.ModifiedDate, A.AdUID
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY vaa.AdID DESC) AS RowNum,
vaa.AdID, vaa.AdTitle, vaa.CityID, vaa.AdFor, vaa.AdCondition,
vaa.AdExpPrice, vaa.CreatedDate, vaa.ModifiedDate, vaa.AdUID
FROM Catalogue.vwAvailableActiveAds vaa
WHERE vaa.CategoryID = #intCategoryID AND vaa.CountryCode = #chrCountryCode
AND vaa.CreatedDate > DATEADD(dd, -90, GETUTCDATE())
AND vaa.StateID = #inbStateID AND vaa.CityID = #inbCityID
) A
WHERE A.RowNum > (#insPageSize * (#insPageNo - 1))
if I try to execute only inner query:
SELECT ROW_NUMBER() OVER (ORDER BY vaa.AdID DESC) AS RowNum,
vaa.AdID, vaa.AdTitle, vaa.CityID, vaa.AdFor, vaa.AdCondition,
vaa.AdExpPrice, vaa.CreatedDate, vaa.ModifiedDate, vaa.AdUID
FROM Catalogue.vwAvailableActiveAds vaa
WHERE vaa.CategoryID = #intCategoryID AND vaa.CountryCode = #chrCountryCode
AND vaa.CreatedDate > DATEADD(dd, -90, GETUTCDATE())
AND vaa.StateID = #inbStateID AND vaa.CityID = #inbCityID
It does not use any index. AdID is primary key & there is another non clustered index which covers all where clause. But index scan occurs. If I remove the Row_Number() from inner query & check its execution plan, all indexes works fine but again StateID & CityID display as "predicate" while they are in non clustered index.
Please give me some guidance to solve my both problems.
What do you expect, a seek? You are doing several things here that make it very difficult to perform a seek: (a) returning a RANGE of rows; (b) sorting to get ROW_NUMBER(), by AdID DESC - probably not the order your PK is defined; (c) filtering against something other than the PK, and (d) including many columns in the output that are unlikely covered by any NC index. A lot of people throw their hands in the air, yelling, "Oh my gosh! It's a scan! This is terrible!" Even in cases where, in fact, that's the most efficient way to do it.
(Just because a seek doesn't happen doesn't mean "indexes don't work" - it just means they probably would be even less efficient in satisfying this query.)
row_number() is not using the index, because the data has to be filtered first. And, this would seem to require a full table scan. Or, at least, complicated combinations of indexes.
If you build an index on vwAvailableActiveAds(CategoryId, COuntryCode, StateId, CityId, CreatedDate) then it should use this index for the where clause. The application of row_number() would still not use the index, but would presumably be on a much smaller set of data.
By the way, this assumes that the view is really just a select on an underlying table. If the query in the view is more complicated (even with a where clause, join, or group by), then that particular index might not be the best approach.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.