SQL RANK() versus ROW_NUMBER() - sql

I'm confused about the differences between these. Running the following SQL gets me two idential result sets. Can someone please explain the differences?
SELECT ID, [Description], RANK() OVER(PARTITION BY StyleID ORDER BY ID) as 'Rank' FROM SubStyle
SELECT ID, [Description], ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) as 'RowNumber' FROM SubStyle

You will only see the difference if you have ties within a partition for a particular ordering value.
RANK and DENSE_RANK are deterministic in this case, all rows with the same value for both the ordering and partitioning columns will end up with an equal result, whereas ROW_NUMBER will arbitrarily (non deterministically) assign an incrementing result to the tied rows.
Example: (All rows have the same StyleID so are in the same partition and within that partition the first 3 rows are tied when ordered by ID)
WITH T(StyleID, ID)
AS (SELECT 1,1 UNION ALL
SELECT 1,1 UNION ALL
SELECT 1,1 UNION ALL
SELECT 1,2)
SELECT *,
RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS [RANK],
ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) AS [ROW_NUMBER],
DENSE_RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS [DENSE_RANK]
FROM T
Returns
StyleID ID RANK ROW_NUMBER DENSE_RANK
----------- -------- --------- --------------- ----------
1 1 1 1 1
1 1 1 2 1
1 1 1 3 1
1 2 4 4 2
You can see that for the three identical rows the ROW_NUMBER increments, the RANK value remains the same then it leaps to 4. DENSE_RANK also assigns the same rank to all three rows but then the next distinct value is assigned a value of 2.

ROW_NUMBER : Returns a unique number for each row starting with 1. For rows that have duplicate values,numbers are arbitarily assigned.
Rank : Assigns a unique number for each row starting with 1,except for rows that have duplicate values,in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.

This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
The reason for this is that logically, window functions are calculated before DISTINCT is applied.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+

Simple query without partition clause:
select
sal,
RANK() over(order by sal desc) as Rank,
DENSE_RANK() over(order by sal desc) as DenseRank,
ROW_NUMBER() over(order by sal desc) as RowNumber
from employee
Output:
--------|-------|-----------|----------
sal |Rank |DenseRank |RowNumber
--------|-------|-----------|----------
5000 |1 |1 |1
3000 |2 |2 |2
3000 |2 |2 |3
2975 |4 |3 |4
2850 |5 |4 |5
--------|-------|-----------|----------

Quite a bit:
The rank of a row is one plus the number of ranks that come before the row in question.
Row_number is the distinct rank of rows, without any gap in the ranking.
http://www.bidn.com/blogs/marcoadf/bidn-blog/379/ranking-functions-row_number-vs-rank-vs-dense_rank-vs-ntile

Note, all these windowing functions return an integer-like value.
Often the database will choose a BIGINT datatype, and this take much more space than we need. And, we will rarely need a range from -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
Cast the results as a BYTEINT, SMALLINT, or INTEGER.
These modern systems and hardware are so strong, so you may never see a meaningflul extra use of resources, but I think it's best-practice.

Look this example.
CREATE TABLE [dbo].#TestTable(
[id] [int] NOT NULL,
[create_date] [date] NOT NULL,
[info1] [varchar](50) NOT NULL,
[info2] [varchar](50) NOT NULL,
)
Insert some data
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/1/09', 'Blue', 'Green')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/2/09', 'Red', 'Yellow')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/3/09', 'Orange', 'Purple')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (2, '1/1/09', 'Yellow', 'Blue')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (2, '1/5/09', 'Blue', 'Orange')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (3, '1/2/09', 'Green', 'Purple')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (3, '1/8/09', 'Red', 'Blue')
Repeat same Values for 1
INSERT INTO dbo.#TestTable (id, create_date, info1, info2) VALUES (1,
'1/1/09', 'Blue', 'Green')
Look All
SELECT * FROM #TestTable
Look your results
SELECT Id,
create_date,
info1,
info2,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY create_date DESC) AS RowId,
RANK() OVER(PARTITION BY Id ORDER BY create_date DESC) AS [RANK]
FROM #TestTable
Need to understand the different

I haven't done anything with rank, but I discovered this today with row_number().
select item, name, sold, row_number() over(partition by item order by sold) as row from table_name
This will result in some repeating row numbers since in my case each name holds all items. Each item will be ordered by how many were sold.
+--------+------+-----+----+
|glasses |store1| 30 | 1 |
|glasses |store2| 35 | 2 |
|glasses |store3| 40 | 3 |
|shoes |store2| 10 | 1 |
|shoes |store1| 20 | 2 |
|shoes |store3| 22 | 3 |
+--------+------+-----+----+

Also, pay attention to ORDER BY in PARTITION (Standard AdventureWorks db is used for example) when using RANK.
SELECT as1.SalesOrderID, as1.SalesOrderDetailID, RANK() OVER
(PARTITION BY as1.SalesOrderID ORDER BY as1.SalesOrderID ) ranknoequal
, RANK() OVER (PARTITION BY as1.SalesOrderID ORDER BY
as1.SalesOrderDetailId ) ranknodiff FROM Sales.SalesOrderDetail as1
WHERE SalesOrderId = 43659 ORDER BY SalesOrderDetailId;
Gives result:
SalesOrderID SalesOrderDetailID rank_same_as_partition rank_salesorderdetailid
43659 1 1 1
43659 2 1 2
43659 3 1 3
43659 4 1 4
43659 5 1 5
43659 6 1 6
43659 7 1 7
43659 8 1 8
43659 9 1 9
43659 10 1 10
43659 11 1 11
43659 12 1 12
But if change order by to (use OrderQty :
SELECT as1.SalesOrderID, as1.OrderQty, RANK() OVER (PARTITION BY
as1.SalesOrderID ORDER BY as1.SalesOrderID ) ranknoequal , RANK()
OVER (PARTITION BY as1.SalesOrderID ORDER BY as1.OrderQty ) rank_orderqty
FROM Sales.SalesOrderDetail as1 WHERE SalesOrderId = 43659 ORDER BY
OrderQty;
Gives:
SalesOrderID OrderQty rank_salesorderid rank_orderqty
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 2 1 7
43659 2 1 7
43659 3 1 9
43659 3 1 9
43659 4 1 11
43659 6 1 12
Notice how the Rank changes when we use OrderQty (rightmost column second table) in ORDER BY and how it changes when we use SalesOrderDetailID (rightmost column first table) in ORDER BY.

Related

Select row A if a condition satisfies else select row B for each group

We have 2 tables, bookings and docs
bookings
booking_id | name
100 | "Val1"
101 | "Val5"
102 | "Val6"
docs
doc_id | booking_id | doc_type_id
6 | 100 | 1
7 | 100 | 2
8 | 101 | 1
9 | 101 | 2
10 | 101 | 2
We need the result like this:
booking_id | doc_id
100 | 7
101 | 10
Essentially, we are trying to get the latest record of doc per booking, but if doc_type_id 2 is present, select the latest record of doc type 2 else select latest record of doc_type_id 1.
Is this possible to achieve with a performance friendly query as we need to apply this in a very huge query?
You can do it with FIRST_VALUE() window function by sorting properly the rows for each booking_id so that the rows with doc_type_id = 2 are returned first:
SELECT DISTINCT booking_id,
FIRST_VALUE(doc_id) OVER (PARTITION BY booking_id ORDER BY doc_type_id = 2 DESC, doc_id DESC) rn
FROM docs;
If you want full rows returned then you could use ROW_NUMBER() window function:
SELECT booking_id, doc_id, doc_type_id
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY booking_id ORDER BY doc_type_id = 2 DESC, doc_id DESC) rn
FROM docs
) t
WHERE rn = 1;

Count rows in partition with Order By

I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
When I run the following query, I get the output as I expected.
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
But, when I add ORDER BY to the partition,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?
When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*) will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual
Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num -- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.
Note that such cumulative counts would normally be implemented using RANK() (or related functions).
The cumulative count is subtly different from RANK(). The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.
The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

How to account for Postgresql rank() ties

I've a table teams with 30 rows and has a handful of statistics stored as attributes. For example, goals for, goals against, etc and I've created a view that uses rank() and does a good job ranking the records. Here's an abridged query example and resulting table:
SELECT name,
points,
rank() OVER (ORDER BY points DESC) AS point_tank
FROM teams;
name | points | point_rank
-----------------------+-----------+----------------
Team 1 | 14 | 1
Team 2 | 11 | 2
Team 3 | 9 | 3
Team 4 | 9 | 3
I would like to add an additional column that would return boolean based on whether or not the rank is a tie. eg Team 3 and Team 4 in this example. It might look something like this:
name | points | point_rank | tie
-----------------------+-----------+----------------+----------------
Team 1 | 14 | 1 | false
Team 2 | 11 | 2 | false
Team 3 | 9 | 3 | true
Team 4 | 9 | 3 | true
Any ideas here? Or am I approaching this incorrectly and abusing rank() here? Thanks in advance!
You could use a CTE and then use the lag/lead functions to check for ties:
with ranked as (
SELECT name,
points,
rank() OVER (ORDER BY points DESC) AS point_rank
FROM teams
)
select name, points, point_rank,
( point_rank = lag(point_rank, 1, -1::bigint) over (order by point_rank)
or point_rank = lead(point_rank, 1, -1::bigint) over (order by point_rank)
) as is_tie
from ranked;
The default value for the lag and lead function is needed for the first and last row, to avoid checking for null there.
Example: https://dbfiddle.uk/-01aFLr4
One option would be to place your current query into a common table expression and then use it to identify which ranks are duplicate:
WITH cte AS (
SELECT name,
points,
rank() OVER (ORDER BY points DESC) AS point_rank
FROM teams;
)
SELECT cte.name,
cte.points,
cte.point_rank
CASE WHEN t.point_rank IS NOT NULL THEN 'false' ELSE 'true' END AS tie
FROM cte
LEFT JOIN
(
SELECT point_rank
FROM cte
GROUP BY point_rank
HAVING COUNT(*) = 1
) t
ON cte.point_rank = t.point_rank
SELECT name
, points
, rank() OVER (rrr) AS point_rank
-- , count(*) OVER (ppp) AS ppp_cnt
, rank() OVER (pp2) AS sub_rank
, (COUNT(*) OVER (ppp) > 1) AS is_tie
FROM teams
WINDOW ppp AS (PARTITION BY points )
, pp2 AS (PARTITION BY points ORDER BY ctid )
, rrr AS (ORDER BY points DESC)
ORDER BY points DESC
;
Result (I added two extra rows):
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 6
name | points | point_rank | sub_rank | is_tie
--------+--------+------------+----------+--------
Team_1 | 14 | 1 | 1 | f
Team_2 | 11 | 2 | 1 | f
Team_3 | 9 | 3 | 1 | t
Team_4 | 9 | 3 | 2 | t
Team_5 | 5 | 5 | 1 | t
Team_6 | 5 | 5 | 2 | t
(6 rows)

Renumber dynamic column without update in SQL Server

I have this data
5 | Batman
5 | Superman
5 | Wonderwomen
6 | Green Lantern
6 | Green Arrow
7 | Cyborg
when I do select query, I want renumber to
1 | Batman
1 | Superman
1 | Wonderwomen
2 | Green Lantern
2 | Green Arrow
3 | Cyborg
thought?
EDIT:
thanks to vittore, so i came up with this solution. I'm not sure if my query is good.
I do ROW_NUMBER() twice. In case my sequence Id is jumping, this query will renumbering perfectly.
WITH cte AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY id ORDER BY id asc) AS CteId
FROM MyTable
)
SELECT
ROW_NUMBER() OVER(PARTITION BY CteId ORDER BY CteId asc) AS RenumberColumn
FROM cte
RANK function is what you are looking for
select RANK() OVER (ORDER BY id), name
from t
Check row_number() and dense_rank() when you reading about it as well.
UPDATE: If you just use rank alone, it will give you not the values you want ( 1 1 1 2 2 3 ), but ranked values ( 1 1 1 4 4 6 )
So in order to get (1 2 3) group, rank and join:
select a.r, t.name from t
inner join (select id, rank() over (order by id asc) r
from t group by id) a
on t.id = a.id
If it's always -4, then:
Select (number-4), name
from table
But I doubt it's that simple.

Selecting row with highest ID based on another column

In SQL Server 2008 R2, suppose I have a table layout like this...
+----------+---------+-------------+
| UniqueID | GroupID | Title |
+----------+---------+-------------+
| 1 | 1 | TEST 1 |
| 2 | 1 | TEST 2 |
| 3 | 3 | TEST 3 |
| 4 | 3 | TEST 4 |
| 5 | 5 | TEST 5 |
| 6 | 6 | TEST 6 |
| 7 | 6 | TEST 7 |
| 8 | 6 | TEST 8 |
+----------+---------+-------------+
Is it possible to select every row with the highest UniqueID number, for each GroupID. So according to the table above - if I ran the query, I would expect this...
+----------+---------+-------------+
| UniqueID | GroupID | Title |
+----------+---------+-------------+
| 2 | 1 | TEST 2 |
| 4 | 3 | TEST 4 |
| 5 | 5 | TEST 5 |
| 8 | 6 | TEST 8 |
+----------+---------+-------------+
Been chomping on this for a while, but can't seem to crack it.
Many thanks,
SELECT *
FROM (SELECT uniqueid, groupid, title,
Row_number()
OVER ( partition BY groupid ORDER BY uniqueid DESC) AS rn
FROM table) a
WHERE a.rn = 1
With SQL-Server as rdbms you can use a ranking function like ROW_NUMBER:
WITH CTE AS
(
SELECT UniqueID, GroupID, Title,
RN = ROW_NUMBER() OVER (PARTITON BY GroupID
ORDER BY UniqueID DESC)
FROM dbo.TableName
)
SELECT UniqueID, GroupID, Title
FROM CTE
WHERE RN = 1
This returns exactly one record for each GroupID even if there are multiple rows with the highest UniqueID (the name does not suggest so). If you want to return all rows in then use DENSE_RANK instead of ROW_NUMBER.
Here you can see all functions and how they work: http://technet.microsoft.com/en-us/library/ms189798.aspx
Since you have not mentioned any RDBMS, this statement below will work on almost all RDBMS. The purpose of the subquery is to get the greatest uniqueID for every GROUPID. To be able to get the other columns, the result of the subquery is joined on the original table.
SELECT a.*
FROM tableName a
INNER JOIN
(
SELECT GroupID, MAX(uniqueID) uniqueID
FROM tableName
GROUP By GroupID
) b ON a.GroupID = b.GroupID
AND a.uniqueID = b.uniqueID
In the case that your RDBMS supports Qnalytic functions, you can use ROW_NUMBER()
SELECT uniqueid, groupid, title
FROM
(
SELECT uniqueid, groupid, title,
ROW_NUMBER() OVER (PARTITION BY groupid
ORDER BY uniqueid DESC) rn
FROM tableName
) x
WHERE x.rn = 1
TSQL Ranking Functions
The ROW_NUMBER() generates sequential number which you can filter out. In this case the sequential number is generated on groupid and sorted by uniqueid in descending order. The greatest uniqueid will have a value of 1 in rn.
SELECT *
FROM the_table tt
WHERE NOT EXISTS (
SELECT *
FROM the_table nx
WHERE nx.GroupID = tt.GroupID
AND nx.UniqueID > tt.UniqueID
)
;
Should work in any DBMS (no window functions or CTEs are needed)
is probably faster than a sub query with an aggregate
Keeping it simple:
select * from test2
where UniqueID in (select max(UniqueID) from test2 group by GroupID)
Considering:
create table test2
(
UniqueID numeric,
GroupID numeric,
Title varchar(100)
)
insert into test2 values(1,1,'TEST 1')
insert into test2 values(2,1,'TEST 2')
insert into test2 values(3,3,'TEST 3')
insert into test2 values(4,3,'TEST 4')
insert into test2 values(5,5,'TEST 5')
insert into test2 values(6,6,'TEST 6')
insert into test2 values(7,6,'TEST 7')
insert into test2 values(8,6,'TEST 8')