Creating row_number that starts over for a subset of data - sql

I'm pretty stumped with my latest issue. Here's how my data looks
ID Item Price Rank
1 A 1.00 1
2 A 2.00 2
3 A 2.00 2
4 A 4.00 3
5 B 1.23 1
6 B 2.24 2
7 B 4.40 3
The problem is when there is a 'tie' (see rows id = 2 and 3) , I want it to be:
ID Item Price Rank
1 A 1.00 1
2 A 2.00 2
3 A 2.00 3
4 A 4.00 4
I know I could do it really easily with a cursor, but I think performance-wise that would be terrible. I tried using window functions like LAG and Row_Number but you're still dealing with row by row decisions. And I couldn't come up with a way to iterate through and then start over when you get to the next item.
Can anyone think of a better way to deal with this instead of a cursor? Sadly, correcting the source data is not really an option.

Window Functions
Here you have to use row_number() instead of dense_rank() or rank()
select ID,Item,Price,row_number() over(order by Item,Price)
from Table1

Your are looking for both row_number() and a partition by clause:
select id, item, price, row_number() over (partition by item order by price, id) as rank
from table t;

May be you can try below query:
SELECT ID, ITEM, PRICE,
ROW_NUMBER() OVER
(
PARTITION BY ITEM
ORDER BY PRICE, ID
) AS RANK
FROM MY_TABLE;
This will get you output:
ID | Item | Price |Rank
1 | A | 1.00 | 1
2 | A | 2.00 | 2
3 | A | 2.00 | 3
4 | A | 4.00 | 4
5 | B | 1.23 | 1
6 | B | 2.24 | 2
7 | B | 4.40 | 3

Try,
In which OrderBy you want
select id, item, price, Row_Number() over (partition by item order by price, id) as Row_Number,
Dense_Rank() over (partition by item order by price, id) as Dense_Rank,
Rank() over (partition by item order by price, id) as Rank
from table YourTableName;

Related

Select row A if a condition satisfies else select row B for each group

We have 2 tables, bookings and docs
bookings
booking_id | name
100 | "Val1"
101 | "Val5"
102 | "Val6"
docs
doc_id | booking_id | doc_type_id
6 | 100 | 1
7 | 100 | 2
8 | 101 | 1
9 | 101 | 2
10 | 101 | 2
We need the result like this:
booking_id | doc_id
100 | 7
101 | 10
Essentially, we are trying to get the latest record of doc per booking, but if doc_type_id 2 is present, select the latest record of doc type 2 else select latest record of doc_type_id 1.
Is this possible to achieve with a performance friendly query as we need to apply this in a very huge query?
You can do it with FIRST_VALUE() window function by sorting properly the rows for each booking_id so that the rows with doc_type_id = 2 are returned first:
SELECT DISTINCT booking_id,
FIRST_VALUE(doc_id) OVER (PARTITION BY booking_id ORDER BY doc_type_id = 2 DESC, doc_id DESC) rn
FROM docs;
If you want full rows returned then you could use ROW_NUMBER() window function:
SELECT booking_id, doc_id, doc_type_id
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY booking_id ORDER BY doc_type_id = 2 DESC, doc_id DESC) rn
FROM docs
) t
WHERE rn = 1;

Is it possible to do projection in Google Big Query?

I have a query (due to restrictions, it is using Legacy SQL) that produces a column that is the rolling average of last 3 days of sale (excluding today)
SELECT
id, date, sales, AVG(sales) OVER (PARTITION BY id ORDER BY date RANGE BETWEEN 4 PRECEDING AND 1 PRECEDING) AS projected_sale
FROM tableA
tableA
+-------+---------+---------+
| id | date | sales |
+-------+---------+---------+
| 1 | 01-01-17| 5 |
| 1 | 01-02-17| 6 |
| 1 | 01-03-17| 7 |
| 1 | 01-04-17| 10 |
+-------+---------+---------+
The query produces
+-------+---------+---------+--------------+
| id | date | sales |projected_sale|
+-------+---------+---------+--------------+
| 1 | 01-01-17| 5 | . |
| 1 | 01-02-17| 6 | . |
| 1 | 01-03-17| 7 | . |
| 1 | 01-04-17| 10 | 6 |
+-------+---------+---------+--------------+
Since the average is excluding the current row, theoretically I can project the sale for 01-05-17 using the sales from (01-02 to 01-04). However since tableA doesn't actually have a entry with date 01-05-17, my query stops at 01-04-17 as the last row.
Is what I am trying to do possible in Big Query?
Thank you
First, I think using RANGE is incorrect here - it should be ROWS instead
Anyway, below is an example for BigQuery Legacy SQL that demonstrates how to achieve result you need.
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
As you can see here you just simply adding (UNION ALL - comma in Kegacy SQL) that missing day. Of course you can transform that one such that it will add such missing row for all id's
Nevetherless - hope this is a good starting point for you
You can test / play with it using dummy data as in your question
#legacySQL
SELECT
id, dt, sales,
AVG(sales) OVER (
PARTITION BY id ORDER BY dt
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING
) AS projected_sale
FROM (
SELECT * FROM
(SELECT 1 id, '01-01-17' dt, 5 sales),
(SELECT 1 id, '01-02-17' dt, 6 sales),
(SELECT 1 id, '01-03-17' dt, 7 sales),
(SELECT 1 id, '01-04-17' dt, 10 sales)
) tableA, (SELECT 1 id, '01-05-17' dt, 0 sales)
with result as
Row id dt sales projected_sale
1 1 01-01-17 5 null
2 1 01-02-17 6 5.0
3 1 01-03-17 7 5.5
4 1 01-04-17 10 6.0
5 1 01-05-17 0 7.0

SQL to get first date and amount per account

I want to get back the date and amount of the first transaction per account in a transaction table. The table (GiftHeader) looks like this:
EntityID Date Amount
1 1/1/2027 00:00:00:00 1.00
1 2/1/2027 00:00:00:00 2.00
2 2/1/2027 00:00:00:00 4.00
2 3/1/2027 00:00:00:00 2.00
In this case, I would expect the following:
EntityID BatchDate Amount
1 1/1/2027 00:00:00:00 1.00
2 2/1/2027 00:00:00:00 4.00
Here's the SQL I'm using which isn't working.
select DISTINCT entityid, min(BatchDate) as FirstGiftDate
from GiftHeader
group by EntityId,BatchDate
order by EntityId
Any help would be appreciated.
Regards,
Joshua Goodwin
You can use top 1 with ties as below
Select top 1 with ties * from GiftHeader
order by row_number() over (partition by entityid order by [BatchDate])
Other traditional approach is
Select * from (
Select *, RowN = row_number() over (partition by entityid order by BatchDate) from GiftHeader ) a
Where a.RowN = 1
Output:
+----------+-------------------------+--------+
| EntityId | BatchDate | Amount |
+----------+-------------------------+--------+
| 1 | 2027-01-01 00:00:00.000 | 1 |
| 2 | 2027-02-01 00:00:00.000 | 4 |
+----------+-------------------------+--------+
You can use ROW_NUMBER as follow
SELECT EntityID,
Date,
Amount
FROM (SELECT ROW_NUMBER()
OVER (
PARTITION BY EntityID
ORDER BY Date) AS RN,
*
FROM GiftHeader) a
WHERE a.RN = 1

Rank function for date in Oracle SQL

I have the following code for example:
SELECT id, order_day, purchase_id FROM d
customer_id and purchase_id are unique. Each customer_id could have multiple purchase_id. Assume every one has made at least 5 orders.
Now, I just want to pull the first 5 purchase IDs of each customers ID (this depends on the earliest dates of purchases). I want the result to look like this:
id | purchase_id | rank
-------------------------
A | WERFEW43 | 1
A | ERTGDSFV | 3
A | FDGRT45 | 2
A | BRTE4TEW | 4
A | DFGDV | 5
B | DSFSF | 1
B | CF345 | 2
B | SDFSDFSDFS | 4
I thought of Ranking order_day, but my knowledge is not good enough to pull this off.
select id,purchase_id, rank() over (order by order_day)
from d
you also can try dense_rank() over (order by order_day) and row_number() over (order by order_day) and choose which one will be more suitable for you
select *
from
( SELECT
id
,order_day
,purchase_id
,row_number() -- ranking
over (partition by id -- each customer
order by order_day) as rn -- based on oldest dates
FROM d
) as dt
where rn <= 5

SQL RANK() versus ROW_NUMBER()

I'm confused about the differences between these. Running the following SQL gets me two idential result sets. Can someone please explain the differences?
SELECT ID, [Description], RANK() OVER(PARTITION BY StyleID ORDER BY ID) as 'Rank' FROM SubStyle
SELECT ID, [Description], ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) as 'RowNumber' FROM SubStyle
You will only see the difference if you have ties within a partition for a particular ordering value.
RANK and DENSE_RANK are deterministic in this case, all rows with the same value for both the ordering and partitioning columns will end up with an equal result, whereas ROW_NUMBER will arbitrarily (non deterministically) assign an incrementing result to the tied rows.
Example: (All rows have the same StyleID so are in the same partition and within that partition the first 3 rows are tied when ordered by ID)
WITH T(StyleID, ID)
AS (SELECT 1,1 UNION ALL
SELECT 1,1 UNION ALL
SELECT 1,1 UNION ALL
SELECT 1,2)
SELECT *,
RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS [RANK],
ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) AS [ROW_NUMBER],
DENSE_RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS [DENSE_RANK]
FROM T
Returns
StyleID ID RANK ROW_NUMBER DENSE_RANK
----------- -------- --------- --------------- ----------
1 1 1 1 1
1 1 1 2 1
1 1 1 3 1
1 2 4 4 2
You can see that for the three identical rows the ROW_NUMBER increments, the RANK value remains the same then it leaps to 4. DENSE_RANK also assigns the same rank to all three rows but then the next distinct value is assigned a value of 2.
ROW_NUMBER : Returns a unique number for each row starting with 1. For rows that have duplicate values,numbers are arbitarily assigned.
Rank : Assigns a unique number for each row starting with 1,except for rows that have duplicate values,in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.
This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
The reason for this is that logically, window functions are calculated before DISTINCT is applied.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+
Simple query without partition clause:
select
sal,
RANK() over(order by sal desc) as Rank,
DENSE_RANK() over(order by sal desc) as DenseRank,
ROW_NUMBER() over(order by sal desc) as RowNumber
from employee
Output:
--------|-------|-----------|----------
sal |Rank |DenseRank |RowNumber
--------|-------|-----------|----------
5000 |1 |1 |1
3000 |2 |2 |2
3000 |2 |2 |3
2975 |4 |3 |4
2850 |5 |4 |5
--------|-------|-----------|----------
Quite a bit:
The rank of a row is one plus the number of ranks that come before the row in question.
Row_number is the distinct rank of rows, without any gap in the ranking.
http://www.bidn.com/blogs/marcoadf/bidn-blog/379/ranking-functions-row_number-vs-rank-vs-dense_rank-vs-ntile
Note, all these windowing functions return an integer-like value.
Often the database will choose a BIGINT datatype, and this take much more space than we need. And, we will rarely need a range from -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807.
Cast the results as a BYTEINT, SMALLINT, or INTEGER.
These modern systems and hardware are so strong, so you may never see a meaningflul extra use of resources, but I think it's best-practice.
Look this example.
CREATE TABLE [dbo].#TestTable(
[id] [int] NOT NULL,
[create_date] [date] NOT NULL,
[info1] [varchar](50) NOT NULL,
[info2] [varchar](50) NOT NULL,
)
Insert some data
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/1/09', 'Blue', 'Green')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/2/09', 'Red', 'Yellow')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (1, '1/3/09', 'Orange', 'Purple')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (2, '1/1/09', 'Yellow', 'Blue')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (2, '1/5/09', 'Blue', 'Orange')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (3, '1/2/09', 'Green', 'Purple')
INSERT INTO dbo.#TestTable (id, create_date, info1, info2)
VALUES (3, '1/8/09', 'Red', 'Blue')
Repeat same Values for 1
INSERT INTO dbo.#TestTable (id, create_date, info1, info2) VALUES (1,
'1/1/09', 'Blue', 'Green')
Look All
SELECT * FROM #TestTable
Look your results
SELECT Id,
create_date,
info1,
info2,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY create_date DESC) AS RowId,
RANK() OVER(PARTITION BY Id ORDER BY create_date DESC) AS [RANK]
FROM #TestTable
Need to understand the different
I haven't done anything with rank, but I discovered this today with row_number().
select item, name, sold, row_number() over(partition by item order by sold) as row from table_name
This will result in some repeating row numbers since in my case each name holds all items. Each item will be ordered by how many were sold.
+--------+------+-----+----+
|glasses |store1| 30 | 1 |
|glasses |store2| 35 | 2 |
|glasses |store3| 40 | 3 |
|shoes |store2| 10 | 1 |
|shoes |store1| 20 | 2 |
|shoes |store3| 22 | 3 |
+--------+------+-----+----+
Also, pay attention to ORDER BY in PARTITION (Standard AdventureWorks db is used for example) when using RANK.
SELECT as1.SalesOrderID, as1.SalesOrderDetailID, RANK() OVER
(PARTITION BY as1.SalesOrderID ORDER BY as1.SalesOrderID ) ranknoequal
, RANK() OVER (PARTITION BY as1.SalesOrderID ORDER BY
as1.SalesOrderDetailId ) ranknodiff FROM Sales.SalesOrderDetail as1
WHERE SalesOrderId = 43659 ORDER BY SalesOrderDetailId;
Gives result:
SalesOrderID SalesOrderDetailID rank_same_as_partition rank_salesorderdetailid
43659 1 1 1
43659 2 1 2
43659 3 1 3
43659 4 1 4
43659 5 1 5
43659 6 1 6
43659 7 1 7
43659 8 1 8
43659 9 1 9
43659 10 1 10
43659 11 1 11
43659 12 1 12
But if change order by to (use OrderQty :
SELECT as1.SalesOrderID, as1.OrderQty, RANK() OVER (PARTITION BY
as1.SalesOrderID ORDER BY as1.SalesOrderID ) ranknoequal , RANK()
OVER (PARTITION BY as1.SalesOrderID ORDER BY as1.OrderQty ) rank_orderqty
FROM Sales.SalesOrderDetail as1 WHERE SalesOrderId = 43659 ORDER BY
OrderQty;
Gives:
SalesOrderID OrderQty rank_salesorderid rank_orderqty
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 1 1 1
43659 2 1 7
43659 2 1 7
43659 3 1 9
43659 3 1 9
43659 4 1 11
43659 6 1 12
Notice how the Rank changes when we use OrderQty (rightmost column second table) in ORDER BY and how it changes when we use SalesOrderDetailID (rightmost column first table) in ORDER BY.