Find the longest streak of perfect scores per player - sql

I have a the following result from a SELECT query with ORDER BY player_id ASC, time ASC in PostgreSQL database:
player_id points time
395 0 2018-06-01 17:55:23.982413-04
395 100 2018-06-30 11:05:21.8679-04
395 0 2018-07-15 21:56:25.420837-04
395 100 2018-07-28 19:47:13.84652-04
395 0 2018-11-27 17:09:59.384-05
395 100 2018-12-02 08:56:06.83033-05
399 0 2018-05-15 15:28:22.782945-04
399 100 2018-06-10 12:11:18.041521-04
454 0 2018-07-10 18:53:24.236363-04
675 0 2018-08-07 20:59:15.510936-04
696 0 2018-08-07 19:09:07.126876-04
756 100 2018-08-15 08:21:11.300871-04
756 100 2018-08-15 16:43:08.698862-04
756 0 2018-08-15 17:22:49.755721-04
756 100 2018-10-07 15:30:49.27374-04
756 0 2018-10-07 15:35:00.975252-04
756 0 2018-11-27 19:04:06.456982-05
756 100 2018-12-02 19:24:20.880022-05
756 100 2018-12-04 19:57:48.961111-05
I'm trying to find each player's longest streak where points = 100, with the tiebreaker being whichever streak began most recently. I also need to determine the time at which that player's longest streak began. The expected result would be:
player_id longest_streak time_began
395 1 2018-12-02 08:56:06.83033-05
399 1 2018-06-10 12:11:18.041521-04
756 2 2018-12-02 19:24:20.880022-05

A gaps-and-islands problem indeed.
Assuming:
"Streaks" are not interrupted by rows from other players.
All columns are defined NOT NULL. (Else you have to do more.)
This should be simplest and fastest as it only needs two fast row_number() window functions:
SELECT DISTINCT ON (player_id)
player_id, count(*) AS seq_len, min(ts) AS time_began
FROM (
SELECT player_id, points, ts
, row_number() OVER (PARTITION BY player_id ORDER BY ts)
- row_number() OVER (PARTITION BY player_id, points ORDER BY ts) AS grp
FROM tbl
) sub
WHERE points = 100
GROUP BY player_id, grp -- omit "points" after WHERE points = 100
ORDER BY player_id, seq_len DESC, time_began DESC;
db<>fiddle here
Using the column name ts instead of time, which is a reserved word in standard SQL. It's allowed in Postgres, but with limitations and it's still a bad idea to use it as identifier.
The "trick" is to subtract row numbers so that consecutive rows fall in the same group (grp) per (player_id, points). Then filter the ones with 100 points, aggregate per group and return only the longest, most recent result per player.
Basic explanation for the technique:
Select longest continuous sequence
We can use GROUP BY and DISTINCT ON in the same SELECT, GROUP BY is applied before DISTINCT ON. Consider the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
About DISTINCT ON:
Select first row in each GROUP BY group?

This is a gap and island problem, you can try to use SUM condition aggravated function with window function, getting gap number.
then use MAX and COUNT window function again.
Query 1:
WITH CTE AS (
SELECT *,
SUM(CASE WHEN points = 100 THEN 1 END) OVER(PARTITION BY player_id ORDER BY time) -
SUM(1) OVER(ORDER BY time) RN
FROM T
)
SELECT player_id,
MAX(longest_streak) longest_streak,
MAX(cnt) longest_streak
FROM (
SELECT player_id,
MAX(time) OVER(PARTITION BY rn,player_id) longest_streak,
COUNT(*) OVER(PARTITION BY rn,player_id) cnt
FROM CTE
WHERE points > 0
) t1
GROUP BY player_id
Results:
| player_id | longest_streak | longest_streak |
|-----------|-----------------------------|----------------|
| 756 | 2018-12-04T19:57:48.961111Z | 2 |
| 399 | 2018-06-10T12:11:18.041521Z | 1 |
| 395 | 2018-12-02T08:56:06.83033Z | 1 |

One way to do this is to look at how many rows between the previous and next non-100 results. To get the lengths of the streaks:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
)
select s.*,
coalesce(next_seqnum, cnt + 1) - coalesce(prev_seqnum, 0) - 1 as length
from (select s.*,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as prev_seqnum,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as next_seqnum
from s
) s
where score = 100;
You can then incorporate the other conditions:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
),
streaks as (
select s.*,
coalesce(next_seqnum - prev_seqnum) over (partition by player_id) as length,
max(next_seqnum - prev_seqnum) over (partition by player_id) as max_length,
max(next_seqnum) over (partition by player_id) as max_next_seqnum
from (select s.*,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), 0) as prev_seqnum,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), cnt + 1) as next_seqnum
from s
) s
where score = 100
)
select s.*
from streaks s
where length = max_length and
next_seqnum = max_next_seqnum;

Here is my answer
select
user_id,
non_streak,
streak,
ifnull(non_streak,streak) strk,
max(time) time
from (
Select
user_id,time,
points,
lag(points) over (partition by user_id order by time) prev_point,
case when points + lag(points) over (partition by user_id order by time) = 100 then 1 end as non_streak,
case when points + lag(points) over (partition by user_id order by time) > 100 then 1 end as streak
From players
) where ifnull(non_streak,streak) is not null
group by 1,2,3
order by 1,2
) group by user_id`

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.
You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn
In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3

Efficient way to get the average of past x events within d days per each row in SQL (big data)

I want to find the best and most efficient way to calculate the average of a score from the past 2 events within 7 days, and I need it per each row.
I already have a query that works on 60M rows, but on 100% (~500M rows) of the data its collapses (maybe not efficient or maybe lack of resources).
can you help? If you think my solution is not the best way please explain.
Thank you
I have this table:
user_id event_id start end score
---------------------------------------------------
1 7 30/01/2021 30/01/2021 45
1 6 24/01/2021 29/01/2021 25
1 5 22/01/2021 23/01/2021 13
1 4 18/01/2021 21/01/2021 15
1 3 17/01/2021 17/01/2021 52
1 2 08/01/2021 10/01/2021 8
1 1 01/01/2021 02/01/2021 36
I want per line (user id+event id): to get the average score of the past 2 events in the last 7 days.
Example: for this row:
user_id event_id start end score
---------------------------------------------------
1 6 24/01/2021 29/01/2021 25
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 null null
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
1 3 17/01/2021 17/01/2021 52 yes 3
1 2 08/01/2021 10/01/2021 8 no 4
1 1 01/01/2021 02/01/2021 36 no 5
so I would select only this rows for the group by and then avg(score):
user_id event_id start end score past_7_days_from_start event_num
--------------------------------------------------------------------------------------
1 5 22/01/2021 23/01/2021 13 yes 1
1 4 18/01/2021 21/01/2021 15 yes 2
Result:
user_id event_id start end score avg_score_of_past_2_events_within_7_days
--------------------------------------------------------------------------------------
1 6 24/01/2021 29/01/2021 25 14
My query:
SELECT user_id, event_id, AVG(score) as avg_score_of_past_2_events_within_7_days
FROM (
SELECT
B.user_id, B.event_id, A.score,
ROW_NUMBER() OVER (PARTITION BY B.user_id, B.event_id ORDER BY A.end desc) AS event_num,
FROM
"df" A
INNER JOIN
(SELECT user_id, event_id, start FROM "df") B
ON B.user_id = FTP.user_id
AND (A.end BETWEEN DATE_SUB(B.start, INTERVAL 7 DAY) AND B.start))
WHERE event_num >= 2
GROUP BY user_id, event_id
Any suggestion for a better way?
I don't believe in your case, there is a more efficient query.
I can suggest you do the following:
Make sure your base table is partition by start and cluster by user_id
Split the query to 3 parts that creating partitioned and clustered tables:
first table: only the inner join O(n^2)
second table: add ROW_NUMBER O(n)
third table: group by
If it is still a problem I would suggest doing batch preprocessing and run the queries by dates.
I've tried to create a use case with using LEAD functions, but I am not able to test if works on that large dataset.
I create the two before rows as prev and ante using LEAD.
Then I have an IF for the 7 days window, and if that matches I create scorePP and scoreAA otherwise they are null.
with t as (
select 1 as user_id,7 as event_id,parse_date('%d/%m/%Y','30/01/2021') as start,parse_date('%d/%m/%Y','30/01/2021') as stop, 45 as score union all
select 1 as user_id,6 as event_id,parse_date('%d/%m/%Y','24/01/2021') as start,parse_date('%d/%m/%Y','29/01/2021') as stop, 25 as score union all
select 1 as user_id,5 as event_id,parse_date('%d/%m/%Y','22/01/2021') as start,parse_date('%d/%m/%Y','23/01/2021') as stop, 13 as score union all
select 1 as user_id,4 as event_id,parse_date('%d/%m/%Y','18/01/2021') as start,parse_date('%d/%m/%Y','21/01/2021') as stop, 15 as score union all
select 1 as user_id,3 as event_id,parse_date('%d/%m/%Y','17/01/2021') as start,parse_date('%d/%m/%Y','17/01/2021') as stop, 52 as score union all
select 1 as user_id,2 as event_id,parse_date('%d/%m/%Y','08/01/2021') as start,parse_date('%d/%m/%Y','10/01/2021') as stop, 8 as score union all
select 1 as user_id,1 as event_id,parse_date('%d/%m/%Y','01/01/2021') as start,parse_date('%d/%m/%Y','02/01/2021') as stop, 36 as score union all
select 2 as user_id,3 as event_id,parse_date('%d/%m/%Y','12/01/2021') as start,parse_date('%d/%m/%Y','17/01/2021') as stop, 52 as score union all
select 2 as user_id,2 as event_id,parse_date('%d/%m/%Y','08/01/2021') as start,parse_date('%d/%m/%Y','10/01/2021') as stop, 8 as score union all
select 2 as user_id,1 as event_id,parse_date('%d/%m/%Y','01/01/2021') as start,parse_date('%d/%m/%Y','02/01/2021') as stop, 36 as score
)
select *, (select avg(x) from unnest([scorePP,scoreAA]) as x) as avg_score_7_day from (
SELECT
t.*,
lead(start,1) over(partition by user_id order by event_id desc, t.stop desc) prev_start,
lead(stop,1) over(partition by user_id order by event_id desc, t.stop desc) prev_stop,
lead(score,1) over(partition by user_id order by event_id desc, t.stop desc) prev_score,
if(((lead(start,1) over(partition by user_id order by event_id desc, t.stop desc)) between date_sub(start, interval 7 day) and (lead(stop,1) over(partition by user_id order by event_id desc, t.stop desc))),lead(score,1) over(partition by user_id order by event_id desc, t.stop desc),null) as scorePP,
/**/
lead(start,2) over(partition by user_id order by event_id desc, t.stop desc) ante_start,
lead(stop,2) over(partition by user_id order by event_id desc, t.stop desc) ante_stop,
lead(score,2) over(partition by user_id order by event_id desc, t.stop desc) ante_score,
if(((lead(start,2) over(partition by user_id order by event_id desc, t.stop desc)) between date_sub(start, interval 7 day) and (lead(stop,2) over(partition by user_id order by event_id desc, t.stop desc))),lead(score,2) over(partition by user_id order by event_id desc, t.stop desc),null) as scoreAA,
from
t
)
where coalesce(scorePP,scoreAA) is not null
order by user_id,event_id desc
Consider below approach
select * except(candidates1, candidates2),
( select avg(score)
from (
select * from unnest(candidates1) union distinct
select * from unnest(candidates2)
order by event_id desc
limit 2
)
) as avg_score_of_past_2_events_within_7_days
from (
select *,
array_agg(struct(event_id, score)) over(order by unix_date(t.start) range between 7 preceding and 1 preceding) as candidates1,
array_agg(struct(event_id, score)) over(order by unix_date(t.end) range between 7 preceding and 1 preceding) as candidates2
from your_table t
)
if applied to sample data in your question - output is

Calculate the streaks of visit of users limited to 7

I am trying to calculate the consecutive visits a user makes on an app. I used the rank function to determine the streaks maintained by each user. However, my requirement is that the streaks should not exceed 7.
For instance, if a user visits the app for 9 consecutive days. He will have 2 different streaks: one with count 7 and the other with 2.
Using MaxCompute. It's similar to MySQL.
I have the following table named visitors_data:
user_id visit_date
murtaza 01-01-2021
john 01-01-2021
murtaza 02-01-2021
murtaza 03-01-2021
murtaza 04-01-2021
john 01-01-2021
murtaza 05-01-2021
murtaza 06-01-2021
john 02-01-2021
john 03-01-2021
murtaza 07-01-2021
murtaza 08-01-2021
murtaza 09-01-2021
john 20-01-2021
john 21-01-2021
Output should look like this:
user_id streak
murtaza 7
murtaza 2
john 3
john 2
I was able to get the streaks by the following query, but I could not limit the streaks to 7.
WITH groups AS (
SELECT user_id,
RANK() OVER (ORDER BY user_id, visit_date) AS RANK,
visit_date,
DATEADD(visit_date, -RANK() OVER (ORDER BY user_id, visit_date), 'dd') AS date_group
FROM visitors_data
ORDER BY user_id, visit_date)
SELECT
user_id,
COUNT(*) AS streak
FROM groups
GROUP BY
user_id,
date_group
HAVING COUNT(*)>1
ORDER BY COUNT(*);
My thinking ran along similar lines to forpas':
SELECT user_id, COUNT(*) streak
FROM
(
SELECT
user_id, streak,
FLOOR((ROW_NUMBER() OVER (PARTITION BY user_id, streak ORDER BY visit_date)-1)/7) substreak
FROM
(
SELECT
user_id, visit_date,
SUM(runtot) OVER (PARTITION BY user_id ORDER BY visit_date) streak
FROM (
SELECT
user_id, visit_date,
CASE WHEN DATE_ADD(visit_date, INTERVAL -1 DAY) = LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date) THEN 0 ELSE 1 END as runtot
FROM visitors_data
GROUP BY user_id, visit_date
) x
) y
) z
GROUP BY user_id, streak, substreak
As an explanation of how this works; a usual trick for counting runs of successive records is to use LAG to examine the record before and if there is only e.g. one day difference then put a 0, otherwise put a 1. This then means the first record of a consecutive run is 1, and the rest are 0, so the column ends up looking like ​1,0,0,0,1,0... SUM OVER ORDER BY sums this in a "running total" fashion. This effectively means it forms a counter that ticks up every time the start of a run is encountered so a run of 4 days followed by a gap then a run of 3 days looks like 1,1,1,1,2,2,2 etc and it forms a "streak ID number".
If this is then fed into a row numbering that partitions by the streak ID number, it establishes an incrementing counter that restarts every time the streak ID changes. If we sub 1 off this so it runs from 0 instead of 1 then we can divide it by 7 to get a "sub streak ID" for our 9-long streak that is 0,0,0,0,0,0,0,1,1 (and so on. A streak of 25 would have 7 zeroes, 7 ones, 7 twos, and 4 threes)
All that remains then is to group by the user, the streak ID, the substreakID and count the result
Before the final group and count the data looks like:
Which should give some idea of how it all works
With a mix of window functions and aggregation:
SELECT user_id, COALESCE(NULLIF(MAX(counter) % 7, 0), 7) streak
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY user_id, grp ORDER BY visit_date) counter
FROM (
SELECT *, SUM(flag) OVER (PARTITION BY user_id ORDER BY visit_date) grp
FROM (
SELECT *, COALESCE(DATE_ADD(visit_date, INTERVAL -1 DAY) <>
LAG(visit_date) OVER (PARTITION BY user_id ORDER BY visit_date), 1) flag
FROM (SELECT DISTINCT * FROM visitors_data) t
) t
) t
) t
GROUP BY user_id, grp, FLOOR((counter - 1) / 7)
See the demo.
You could break them up after the fact. For instance, if you never have more than 21:
SELECT user_id, LEAST(streak, 7)
FROM (SELECT user_id, COUNT(*) AS streak
FROM groups
GROUP BY user_id, date_group
HAVING COUNT(*) > 1
) gu JOIN
(SELECT 1 as n UNION ALL SELECT 2 as n UNION ALL SELECT 3 UNION ALL SELECT 4
) n
ON streak >= n * 7
ORDER BY LEAST(streak, 7);
If you have an indeterminate number range for the longest streak, you can do something similar with a recursive CTE>

Teradara SQL - Operation with max-min dates

suppose I have the following data frame in Reradata SQL.
How can I get the variation between the highest and lowest date, at user level? Regards
Initial table
user date price
1 1-1 10
1 2-1 20
1 3-1 30
2 1-1 12
2 2-1 22
2 3-1 32
3 1-1 13
3 2-1 23
3 3-1 33
Final table
user var_price
1 30/10-1
2 32/12-1
3 33/13-1
Try this-
SELECT B.[user],
CAST(SUM(B.max_price) AS VARCHAR)+'/'+CAST(SUM(B.min_price) AS VARCHAR)+ '-1' var_price,
SUM(B.max_price)/SUM(B.min_price) -1 calculated_var_price
FROM
(
SELECT * FROM
(
SELECT [user],0 max_price,price min_price,ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE) RN
FROM your_table
)A WHERE RN = 1
UNION ALL
SELECT * FROM
(
SELECT [user],price max_price,0 min_price, ROW_NUMBER() OVER (PARTITION BY [user] ORDER BY DATE DESC) RN
FROM your_table
)A WHERE RN = 1
)B
GROUP BY B.[user]
Output is-
user var_price calculated_var_price
1 30/10-1 2
2 32/12-1 1
3 33/13-1 1
Is this what you want?
select user, max(price) / min(price) - 1
from t
group by user;
Your values are monotonically increasing, so max() and min() seems like the simplest solution.
EDIT:
You can use window functions:
select user, max(last_price) / max(first_price) - 1
from (select t.*,
first_value(price) over (partition by user order by date rows between unbounded preceding and current_row) as first_price,
first_value(price) over (partition by user order by date desc rows between unbounded preceding and current_row) as last_price
from t
) t
group by user;
select user
,price as first_price
,last_value(price)
over (paritition by user
order by date
rows between unbounded preceding and unbounded following) as last_price
from mytab
qualify
row_number() -- lowest date only
over (paritition by user
order by date) = 1
This returns the row with the lowest date and adds the price of the latest date

SQL partition by on date range

Assume this is my table:
ID NUMBER DATE
------------------------
1 45 2018-01-01
2 45 2018-01-02
2 45 2018-01-27
I need to separate using partition by and row_number where the difference between one date and another is greater than 5 days. Something like this would be the result of the above example:
ROWNUMBER ID NUMBER DATE
-----------------------------
1 1 45 2018-01-01
2 2 45 2018-01-02
1 3 45 2018-01-27
My actual query is something like this:
SELECT ROW_NUMBER() OVER(PARTITION BY NUMBER ODER BY ID DESC) AS ROWNUMBER, ...
But as you can notice, it doesn't work for the dates. How can I achieve that?
You can use lag function :
select *, row_number() over (partition by number, grp order by id) as [ROWNUMBER]
from (select *, (case when datediff(day, lag(date,1,date) over (partition by number order by id), date) <= 1
then 1 else 2
end) as grp
from table
) t;
by using lag and datediff funtion
select * from
(
select t.*,
datediff(day,
lag(DATE) over (partition by NUMBER order by id),
DATE
) as diff
from t
) as TT where diff>5
http://sqlfiddle.com/#!18/130ae/11
I think you want to identify the groups, using lag() and datediff() and a cumulative sum. Then use row_number():
select t.*,
row_number() over (partition by number, grp order by date) as rownumber
from (select t.*,
sum(grp_start) over (partition by number order by date) as grp
from (select t.*,
(case when lag(date) over (partition by number order by date) < dateadd(day, 5, date)
then 1 else 0
end) as grp_start
from t
) t
) t;