Get difference from top ranked item - sql

I'm using the following query to rank items, but I need it to also show the difference between the item and the top ranked item (The top ranked item will show the difference between it and the 2nd ranked item.), like so:
SELECT rdate, rtime, SF, DENSE_RANK() OVER (PARTITION BY rdate, rtime ORDER BY rdate,
rtime, SF DESC) rank
FROM DailySF
rdate | rtime | SF | rank | DiffTop
-------------------------------------------
18/02/2021 09:00 54 1 2
18/02/2021 09:00 52 2 -2
18/02/2021 09:00 50 3 -4
19/02/2021 09:00 53 1 10
19/02/2021 09:00 43 2 -10
19/02/2021 09:00 40 3 -13
19/02/2021 09:00 35 4 -18
How do I create the DiffTop column?

You can use window functions for this as well:
SELECT rdate, rtime, SF,
DENSE_RANK() OVER (PARTITION BY rdate, rtime ORDER BY rdate,
rtime, SF DESC) as rank,
(SF - MAX(SF) OVER (PARTITION BY rdate, rtime)) as diff
FROM DailySF;
The top ranked value is the one with maximum SF.
To handle the top ranked item:
SELECT rdate, rtime, SF,
DENSE_RANK() OVER (PARTITION BY rdate, rtime ORDER BY rdate,
rtime, SF DESC) as rank,
(CASE WHEN SF = MAX(SF) OVER (PARTITION BY rdate, rtime)
THEN SF - LEAD(SF) OVER (PARTITION BY rdate, rtime ORDER BY rtime)
ELSE SF - MAX(SF) OVER (PARTITION BY rdate, rtime)
END) as diff
FROM DailySF;

Related

Need to get maximum date range which is overlapping in SQL

I have a table with 3 columns id, start_date, end_date
Some of the values are as follows:
1 2018-01-01 2030-01-01
1 2017-10-01 2018-10-01
1 2019-01-01 2020-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2010-10-01 2010-12-01
2 2008-01-01 2009-01-01
I have the above kind of data set where I have to filter out overlap date range by keeping maximum datarange and keep the other date range which is not overlapping for a particular id.
Hence desired output should be:
1 2018-01-01 2030-01-01
1 2015-01-01 2016-01-01
2 2010-01-01 2011-02-01
2 2008-01-01 2009-01-01
I am unable to find the right way to code in impala. Can someone please help me.
I have tried like,
with cte as(
select a*, row_number() over(partition by id order by datediff(end_date , start_date) desc) as flag from mytable a) select * from cte where flag=1
but this will remove other date range which is not overlapping. Please help.
use row number with countItem for each id
with cte as(
select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable
)
select id,start_date,end_date
from cte
where seq = 1 or seq = countItem
or without cte
select id,start_date,end_date
from
(select *,
row_number() over(partition by id order by id) as seq,
count(*) over(partition by id order by id) as countItem
from mytable) t
where seq = 1 or seq = countItem
demo in db<>fiddle
You can use a cumulative max to see if there is any overlap with preceding rows. If there is not, then you have the first row of a new group (row in the result set).
A cumulative sum of the starts assigns each row in the source to a group. Then aggregate:
select id, min(start_date), max(end_date)
from (select t.*,
sum(case when prev_end_date >= start_date then 0 else 1 end) over
(partition by id
order by start_date
rows between unbounded preceding and current row
) as grp
from (select t.*,
max(end_date) over (partition by id
order by start_date
rows between unbounded preceding and 1 preceding
) as prev_end_date
from t
) t
) t
group by id, grp;

Find the longest streak of perfect scores per player

I have a the following result from a SELECT query with ORDER BY player_id ASC, time ASC in PostgreSQL database:
player_id points time
395 0 2018-06-01 17:55:23.982413-04
395 100 2018-06-30 11:05:21.8679-04
395 0 2018-07-15 21:56:25.420837-04
395 100 2018-07-28 19:47:13.84652-04
395 0 2018-11-27 17:09:59.384-05
395 100 2018-12-02 08:56:06.83033-05
399 0 2018-05-15 15:28:22.782945-04
399 100 2018-06-10 12:11:18.041521-04
454 0 2018-07-10 18:53:24.236363-04
675 0 2018-08-07 20:59:15.510936-04
696 0 2018-08-07 19:09:07.126876-04
756 100 2018-08-15 08:21:11.300871-04
756 100 2018-08-15 16:43:08.698862-04
756 0 2018-08-15 17:22:49.755721-04
756 100 2018-10-07 15:30:49.27374-04
756 0 2018-10-07 15:35:00.975252-04
756 0 2018-11-27 19:04:06.456982-05
756 100 2018-12-02 19:24:20.880022-05
756 100 2018-12-04 19:57:48.961111-05
I'm trying to find each player's longest streak where points = 100, with the tiebreaker being whichever streak began most recently. I also need to determine the time at which that player's longest streak began. The expected result would be:
player_id longest_streak time_began
395 1 2018-12-02 08:56:06.83033-05
399 1 2018-06-10 12:11:18.041521-04
756 2 2018-12-02 19:24:20.880022-05
A gaps-and-islands problem indeed.
Assuming:
"Streaks" are not interrupted by rows from other players.
All columns are defined NOT NULL. (Else you have to do more.)
This should be simplest and fastest as it only needs two fast row_number() window functions:
SELECT DISTINCT ON (player_id)
player_id, count(*) AS seq_len, min(ts) AS time_began
FROM (
SELECT player_id, points, ts
, row_number() OVER (PARTITION BY player_id ORDER BY ts)
- row_number() OVER (PARTITION BY player_id, points ORDER BY ts) AS grp
FROM tbl
) sub
WHERE points = 100
GROUP BY player_id, grp -- omit "points" after WHERE points = 100
ORDER BY player_id, seq_len DESC, time_began DESC;
db<>fiddle here
Using the column name ts instead of time, which is a reserved word in standard SQL. It's allowed in Postgres, but with limitations and it's still a bad idea to use it as identifier.
The "trick" is to subtract row numbers so that consecutive rows fall in the same group (grp) per (player_id, points). Then filter the ones with 100 points, aggregate per group and return only the longest, most recent result per player.
Basic explanation for the technique:
Select longest continuous sequence
We can use GROUP BY and DISTINCT ON in the same SELECT, GROUP BY is applied before DISTINCT ON. Consider the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
About DISTINCT ON:
Select first row in each GROUP BY group?
This is a gap and island problem, you can try to use SUM condition aggravated function with window function, getting gap number.
then use MAX and COUNT window function again.
Query 1:
WITH CTE AS (
SELECT *,
SUM(CASE WHEN points = 100 THEN 1 END) OVER(PARTITION BY player_id ORDER BY time) -
SUM(1) OVER(ORDER BY time) RN
FROM T
)
SELECT player_id,
MAX(longest_streak) longest_streak,
MAX(cnt) longest_streak
FROM (
SELECT player_id,
MAX(time) OVER(PARTITION BY rn,player_id) longest_streak,
COUNT(*) OVER(PARTITION BY rn,player_id) cnt
FROM CTE
WHERE points > 0
) t1
GROUP BY player_id
Results:
| player_id | longest_streak | longest_streak |
|-----------|-----------------------------|----------------|
| 756 | 2018-12-04T19:57:48.961111Z | 2 |
| 399 | 2018-06-10T12:11:18.041521Z | 1 |
| 395 | 2018-12-02T08:56:06.83033Z | 1 |
One way to do this is to look at how many rows between the previous and next non-100 results. To get the lengths of the streaks:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
)
select s.*,
coalesce(next_seqnum, cnt + 1) - coalesce(prev_seqnum, 0) - 1 as length
from (select s.*,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as prev_seqnum,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as next_seqnum
from s
) s
where score = 100;
You can then incorporate the other conditions:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
),
streaks as (
select s.*,
coalesce(next_seqnum - prev_seqnum) over (partition by player_id) as length,
max(next_seqnum - prev_seqnum) over (partition by player_id) as max_length,
max(next_seqnum) over (partition by player_id) as max_next_seqnum
from (select s.*,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), 0) as prev_seqnum,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), cnt + 1) as next_seqnum
from s
) s
where score = 100
)
select s.*
from streaks s
where length = max_length and
next_seqnum = max_next_seqnum;
Here is my answer
select
user_id,
non_streak,
streak,
ifnull(non_streak,streak) strk,
max(time) time
from (
Select
user_id,time,
points,
lag(points) over (partition by user_id order by time) prev_point,
case when points + lag(points) over (partition by user_id order by time) = 100 then 1 end as non_streak,
case when points + lag(points) over (partition by user_id order by time) > 100 then 1 end as streak
From players
) where ifnull(non_streak,streak) is not null
group by 1,2,3
order by 1,2
) group by user_id`

query to keep partitions separate when physically separated

I have a table that contains order/shipment history. A basic dummy version is:
ORDERS
order_no | order_stat | stat_date
2 | Planned | 01-Jan-2000
2 | Picked | 15-Jan-2000
2 | Planned | 17-Jan-2000
2 | Planned | 05-Feb-2000
2 | Planned | 31-Mar-2000
2 | Picked | 05-Apr-2000
2 | Shipped | 10-Apr-2000
I need to figure out how long each order has been in each order status/phase. The only problem is when I create a partition on the order_no and order_stat, I get results that make sense but are not what I am looking for.
My sql:
select
order_no
,order_stat
,stat_date
,lag(stat_date, 1) over (partition by order_no order by stat_date) prev_stat_date
,stat_date - lag(stat_date, 1) over (partition by order_no order by stat_date) date_diff
,row_number() over(partition by order_no, order_stat order by stat_date) rnk
from
orders
Will give me the following results:
order_no | order_stat | stat_date | prev_stat_date | rnk
2 | Planned | 01-Jan-2000 | | 1
2 | Picked | 15-Jan-2000 | 01-Jan-2000 | 1
2 | Planned | 17-Jan-2000 | 15-Jan-2000 | 2
2 | Planned | 05-Feb-2000 | 17-Jan-2000 | 3
2 | Planned | 31-Mar-2000 | 05-Feb-2000 | 4
2 | Picked | 05-Apr-2000 | 31-Mar-2000 | 2
2 | Shipped | 10-Apr-2000 | 05-Apr-2000 | 1
I would like to have results that look like this (the rnk starts over when it reverts back to a previous order stat):
order_no | order_stat | stat_date | prev_stat_date | rnk
2 | Planned | 01-Jan-2000 | | 1
2 | Picked | 15-Jan-2000 | 01-Jan-2000 | 1
2 | Planned | 17-Jan-2000 | 15-Jan-2000 | 1
2 | Planned | 05-Feb-2000 | 17-Jan-2000 | 2
2 | Planned | 31-Mar-2000 | 05-Feb-2000 | 3
2 | Picked | 05-Apr-2000 | 31-Mar-2000 | 1
2 | Shipped | 10-Apr-2000 | 05-Apr-2000 | 1
I'm trying to get a running total count of how long it has been in the status (that starts over even if the status it changes to has existed previously instead of being included in the previous partition) but I have no idea how to approach this. Any and all insight would be greatly appreciated.
If I understand correctly, this is a gaps-and-islands problem.
The difference of row numbers can be used to identify the "island"s and then to enumerate the values:
select t.*,
row_number() over (partition by order_no, order_stat, seqnum - seqnum_2 order by stat_date) as your_rank
from (select o.*,
row_number() over (partition by order_no order by stat_date) as seqnum,
row_number() over (partition by order_no, order_stat order by stat_date) as seqnum_2
from orders o
) t;
I've left out the other columns (like the lag()) so you can see the logic. It can be a bit hard to follow why this works. If you stare at some rows from the subquery, you will probably see how the difference of the row numbers defines the groups you want.
Continuing #Gordon's Tabibitosan approach, once you have the groupings you can get both the order within each group and the elapsed number of days for each member of the group:
-- CTE for sample data
with orders (order_no, order_stat, stat_date) as (
select 2, 'Planned', date '2000-01-01' from dual
union all select 2, 'Picked', date '2000-01-15' from dual
union all select 2, 'Planned', date '2000-01-17' from dual
union all select 2, 'Planned', date '2000-02-05' from dual
union all select 2, 'Planned', date '2000-03-31' from dual
union all select 2, 'Picked ', date '2000-04-05' from dual
union all select 2, 'Shipped', date '2000-04-10' from dual
)
-- actual query
select order_no, order_stat, stat_date, grp,
dense_rank() over (partition by order_no, order_stat, grp order by stat_date) as rnk,
stat_date - min(stat_date) keep (dense_rank first order by stat_date)
over (partition by order_no, order_stat, grp) as stat_days
from (
select order_no, order_stat, stat_date,
row_number() over (partition by order_no order by stat_date)
- row_number() over (partition by order_no, order_stat order by stat_date) as grp
from orders
)
order by order_no, stat_date;
ORDER_NO ORDER_S STAT_DATE GRP RNK STAT_DAYS
---------- ------- ---------- ---------- ---------- ----------
2 Planned 2000-01-01 0 1 0
2 Picked 2000-01-15 1 1 0
2 Planned 2000-01-17 1 1 0
2 Planned 2000-02-05 1 2 19
2 Planned 2000-03-31 1 3 74
2 Picked 2000-04-05 5 1 0
2 Shipped 2000-04-10 6 1 0
The inline view is essentially what Gordon did, except it trivially does the subtraction at that level. The outer query then gets the rank the same way, but also uses an analytic function to get the earliest date for that group, and subtracts it from the current row's date. You don't have to include grp or rnk in your final result of course, they're shown to give more insight into what's happening.
It isn't clear exactly what you want, but you can expand even further to, for instance:
with cte1 (order_no, order_stat, stat_date, grp) as (
select order_no, order_stat, stat_date,
row_number() over (partition by order_no order by stat_date)
- row_number() over (partition by order_no, order_stat order by stat_date)
from orders
),
cte2 (order_no, order_stat, stat_date, grp, grp_date, rnk) as (
select order_no, order_stat, stat_date, grp,
min(stat_date) keep (dense_rank first order by stat_date)
over (partition by order_no, order_stat, grp),
dense_rank() over (partition by order_no, order_stat, grp order by stat_date)
from cte1
)
select order_no, order_stat, stat_date, grp, grp_date, rnk,
stat_date - grp_date as stat_days_so_far,
case
when order_stat != 'Shipped' then
coalesce(first_value(stat_date)
over (partition by order_no order by grp_date
range between 1 following and unbounded following), trunc(sysdate))
- min(stat_date) keep (dense_rank first order by stat_date)
over (partition by order_no, order_stat, grp)
end as stat_days_total,
stat_date - min(stat_date) over (partition by order_no) as order_days_so_far,
case
when max(order_stat) keep (dense_rank last order by stat_date)
over (partition by order_no) = 'Shipped' then
max(stat_date) over (partition by order_no)
else
trunc(sysdate)
end
- min(stat_date) over (partition by order_no) as order_days_total
from cte2
order by order_no, stat_date;
which for your sample data gives:
ORDER_NO ORDER_S STAT_DATE GRP GRP_DATE RNK STAT_DAYS_SO_FAR STAT_DAYS_TOTAL ORDER_DAYS_SO_FAR ORDER_DAYS_TOTAL
---------- ------- ---------- ---------- ---------- ---------- ---------------- --------------- ----------------- ----------------
2 Planned 2000-01-01 0 2000-01-01 1 0 14 0 100
2 Picked 2000-01-15 1 2000-01-15 1 0 2 14 100
2 Planned 2000-01-17 1 2000-01-17 1 0 79 16 100
2 Planned 2000-02-05 1 2000-01-17 2 19 79 35 100
2 Planned 2000-03-31 1 2000-01-17 3 74 79 90 100
2 Picked 2000-04-05 5 2000-04-05 1 0 5 95 100
2 Shipped 2000-04-10 6 2000-04-10 1 0 100 100
I've included some logic to assume that 'Shipped' is the final status, and if that hasn't been reached then the last status is still running - so counting up to today. That might be wrong, and you might have other end-status values (e.g. cancelled). Anyway, a few things for you to explore and play with...
You might be able to do something similar with match_recognize, but I'll leave that to someone else.

How can I group this in SQL

I have this selection
Trip Sequence Shipment Place
=================================
102 10 4798 Amsterdam
102 20 4823 Utrecht
102 30 4831 Utrecht
102 40 4830 Rotterdam
102 50 4790 Rotterdam
102 60 4840 Utrecht
102 70 4810 Amsterdam
I want this grouped like this:
Trip Group Place
==========================
102 1 Amsterdam
102 2 Utrecht
102 3 Rotterdam
102 4 Utrecht
102 5 Amsterdam
How can I achieve this in SQL server?
Answer of #Giorgos Betos is great. Follow-up question: How can I assign these groupnumbers to the rows of the original table?
Try this:
SELECT Trip, Place,
ROW_NUMBER() OVER (ORDER BY MinSequence) AS [Group]
FROM (
SELECT Trip, Place, MIN(Sequence) AS MinSequence
FROM (
SELECT Trip, Place, Sequence,
ROW_NUMBER() OVER (ORDER BY Sequence) -
ROW_NUMBER() OVER (PARTITION BY Trip, Place
ORDER BY Sequence) AS grp
FROM mytable) AS x
GROUP BY Trip, Place, grp) AS t
Demo here
Edit:
To get the ranking numbers in the original, ungrouped, table you can use DENSE_RANK:
SELECT Trip, Place, Sequence,
DENSE_RANK() OVER (ORDER BY grp) AS [Group]
FROM (
SELECT Trip, Place, Sequence,
MIN(Sequence) OVER (PARTITION BY Trip, Place, grp) AS grp
FROM (
SELECT Trip, Place, Sequence,
ROW_NUMBER() OVER (ORDER BY Sequence) -
ROW_NUMBER() OVER (PARTITION BY Trip, Place
ORDER BY Sequence) AS grp
FROM mytable) AS t) AS x
ORDER BY Sequence
Demo here

window function in redshift

I have some data that looks like this:
CustID EventID TimeStamp
1 17 1/1/15 13:23
1 17 1/1/15 14:32
1 13 1/1/25 14:54
1 13 1/3/15 1:34
1 17 1/5/15 2:54
1 1 1/5/15 3:00
2 17 2/5/15 9:12
2 17 2/5/15 9:18
2 1 2/5/15 10:02
2 13 2/8/15 7:43
2 13 2/8/15 7:50
2 1 2/8/15 8:00
I'm trying to use the row_number function to get it to look like this:
CustID EventID TimeStamp SeqNum
1 17 1/1/15 13:23 1
1 17 1/1/15 14:32 1
1 13 1/1/25 14:54 2
1 13 1/3/15 1:34 2
1 17 1/5/15 2:54 3
1 1 1/5/15 3:00 4
2 17 2/5/15 9:12 1
2 17 2/5/15 9:18 1
2 1 2/5/15 10:02 2
2 13 2/8/15 7:43 3
2 13 2/8/15 7:50 3
2 1 2/8/15 8:00 4
I tried this:
row_number () over
(partition by custID, EventID
order by custID, TimeStamp asc) SeqNum]
but got this back:
CustID EventID TimeStamp SeqNum
1 17 1/1/15 13:23 1
1 17 1/1/15 14:32 2
1 13 1/1/25 14:54 3
1 13 1/3/15 1:34 4
1 17 1/5/15 2:54 5
1 1 1/5/15 3:00 6
2 17 2/5/15 9:12 1
2 17 2/5/15 9:18 2
2 1 2/5/15 10:02 3
2 13 2/8/15 7:43 4
2 13 2/8/15 7:50 5
2 1 2/8/15 8:00 6
how can I get it to sequence based on the change in the EventID?
This is tricky. You need a multi-step process. You need to identify the groups (a difference of row_number() works for this). Then, assign an increasing constant to each group. And then use dense_rank():
select sd.*, dense_rank() over (partition by custid order by mints) as seqnum
from (select sd.*,
min(timestamp) over (partition by custid, eventid, grp) as mints
from (select sd.*,
(row_number() over (partition by custid order by timestamp) -
row_number() over (partition by custid, eventid order by timestamp)
) as grp
from somedata sd
) sd
) sd;
Another method is to use lag() and a cumulative sum:
select sd.*,
sum(case when prev_eventid is null or prev_eventid <> eventid
then 1 else 0 end) over (partition by custid order by timestamp
) as seqnum
from (select sd.*,
lag(eventid) over (partition by custid order by timestamp) as prev_eventid
from somedata sd
) sd;
EDIT:
The last time I used Amazon Redshift it didn't have row_number(). You can do:
select sd.*, dense_rank() over (partition by custid order by mints) as seqnum
from (select sd.*,
min(timestamp) over (partition by custid, eventid, grp) as mints
from (select sd.*,
(row_number() over (partition by custid order by timestamp rows between unbounded preceding and current row) -
row_number() over (partition by custid, eventid order by timestamp rows between unbounded preceding and current row)
) as grp
from somedata sd
) sd
) sd;
Try this code block:
WITH by_day
AS (SELECT
*,
ts::date AS login_day
FROM table_name)
SELECT
*,
login_day,
FIRST_VALUE(login_day) OVER (PARTITION BY userid ORDER BY login_day , userid rows unbounded preceding) AS first_day
FROM by_day